Customer Segmentation in Python¶

Customer Segmentation is a powerful analytics technique to group customers and enable businesses to customize their product offering and marketing strategies.

For example, we can group customers by the month of their first purchase, segment by their recency, frequency and monetary values or use a clustering model like K-means to identify similar groups of customers based on their purchasing behavior.

Introduction to cohort analysis¶

Cohort analysis is a subset of behavioral analytics that takes the data from a given data set (e-commerce platform, web application, or online game) and rather than looking at all users as one unit, it breaks them into related groups for analysis. These related groups, or cohorts, usually share common characteristics or experiences within a defined time-span. Cohort analysis allows a company to “see patterns clearly across the life-cycle of a customer (or user), rather than slicing across all customers blindly without accounting for the natural cycle that a customer undergoes.” By seeing these patterns of time, a company can adapt and tailor its service to those specific cohorts. While cohort analysis is sometimes associated with a cohort study, they are different and should not be viewed as one and the same.

Mutually exclusive segments - cohort which are then measured over times
Compare metrics across product lifecycle
Compare metrics across customer lifecycle

Types of cohorts

Time cohorts: Customers who signed up for a product or service during a particular time frame. Analyzing these cohorts shows the customer behavior depending on the time they started using the companies products or services. Time could be monthly, quarterly or daily.
Behavior cohorts: Customers who purchased a products or subscribed to a service in the past. It groups customers by the type of product or service they signed up for. Customers who sign up for various levels of service may require different needs. Understanding these needs can help companies develop customized products or services for particular segments.
Size cohorts: Various sizes of customers who purchase a companies products or services. This can be based on the amount of spending in some time period after acquisition, or the product type that the customer spent the most money in some period of time

Elements of cohort analysis

Typically pivot table format
Assigned cohort in rows
Cohort Index in columns
Metrics in the table

Online retail data¶

Over $0.5$ million transactions from a UK-based online retail store. We will use a randomly sampled $20\%$ subset of this dataset.

import pandas as pd
import numpy as np 
import datetime as dt

online = pd.read_csv('online.csv', index_col='Unnamed: 0')
online.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70864 entries, 416792 to 312243
Data columns (total 8 columns):
InvoiceNo      70864 non-null int64
StockCode      70864 non-null object
Description    70864 non-null object
Quantity       70864 non-null int64
InvoiceDate    70864 non-null object
UnitPrice      70864 non-null float64
CustomerID     70864 non-null int64
Country        70864 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 4.9+ MB

online.InvoiceDate = pd.to_datetime(online.InvoiceDate)
online.dtypes

InvoiceNo               int64
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID              int64
Country                object
dtype: object

Assign daily acquisition cohort¶

Defining a cohort is the first step to cohort analysis. We will now create daily cohorts based on the day each customer has made their first transaction.

# Define a function that will parse the date
def get_day(x): 
    return dt.datetime(x.year, x.month, x.day)

# Create InvoiceDay column
online['InvoiceDay'] = online['InvoiceDate'].apply(get_day) 

# Group by CustomerID and select the InvoiceDay value
grouping = online.groupby('CustomerID')['InvoiceDay'] 

# Assign a minimum InvoiceDay value to the dataset
online['CohortDay'] = grouping.transform('min')

# View the top 5 rows
online.head()

Now each customer belongs to a daily acquisition cohort that we can use for further analysis!

Calculate time offset in days - part 1¶

Calculating time offset for each transaction allows us to report the metrics for each cohort in a comparable fashion.

First, we will create $6$ variables that capture the integer value of years, months and days for Invoice and Cohort Date using a get_date_int() function

def get_date_int(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column].dt.day
    return year, month, day

# Get the integers for date parts from the InvoiceDaycolumn
invoice_year, invoice_month, invoice_day = get_date_int(online, 'InvoiceDay')

# Get the integers for date parts from the CohortDay column
cohort_year, cohort_month, cohort_day = get_date_int(online, 'CohortDay')

Now we will use these integer values to calculate business metrics for our time cohorts!

Calculate time offset in days - part 2¶

Great work! Now, we have $6$ different data sets with year, month and day values for Invoice and Cohort dates - invoice_year, cohort_year, invoice_month, cohort_month, invoice_day, and cohort_day.

In this exercise we will calculate the difference between the Invoice and Cohort dates in years, months and days separately and then calculate the total days difference between the two. This will be our days offset which we will use in the next exercise to visualize the customer count.

# Calculate difference in years
years_diff = invoice_year - cohort_year

# Calculate difference in months
months_diff = invoice_month - cohort_month

# Calculate difference in days
days_diff = invoice_day - cohort_day

# Extract the difference in days from all previous values
online['CohortIndex'] = years_diff * 365 + months_diff * 30 + days_diff + 1

online.head()

We have successfully assigned the daily time offset to each transaction and can use it for running daily cohort analysis!

Cohort metrics¶

Retention: How many customers from each of the cohort have returned in the subsequent months? gives you the percentage of active customers compared to the total number of customers.

Customer retention: cohort_counts table

How many customers originally in each cohort?
How many of them were active in following months?

Calculate retention rate from scratch¶

We have seen how to create retention and average quantity metrics table for the monthly acquisition cohorts. Now it's time to build the retention metrics.

# Define a function that will parse the month from the date 
def get_month(x): 
    return dt.datetime(x.year, x.month, 1)

# Extract the difference in months from all previous values
online['CohortIndex_M'] = years_diff * 12 + months_diff + 1

# Create InvoiceMonth column
online['InvoiceMonth'] = online['InvoiceDate'].apply(get_month)

# Group by CustomerID and select the InvoiceMonth value
grouping = online.groupby('CustomerID')['InvoiceMonth']

# Assign a minimum InvoiceMonth value to the dataset
online['CohortMonth'] = grouping.transform('min')

# View the top 5 rows
online.head()

# Group by CohortMonth and CohortIndex
grouping = online.groupby(['CohortMonth', 'CohortIndex_M'])

# Count the number of customers in each group
cohort_data = grouping['CustomerID'].apply(pd.Series.nunique)

# Reset the index
cohort_data = cohort_data.reset_index()

# Create a pivot table with CohortMonth as rows, CohortIndex as the columns, and CustomerID counts as the values
cohort_counts = cohort_data.pivot(index='CohortMonth', columns='CohortIndex_M', values='CustomerID')

cohort_counts

# Select the first column and store it to cohort_sizes
cohort_sizes = cohort_counts.iloc[:,0]

# Divide the cohort count by cohort sizes along the rows
retention = cohort_counts.divide(cohort_sizes, axis=0)

Calculate average price¶

We will now calculate the average price metric and analyze if there are any differences in shopping patterns across time and across cohorts.

# Create a groupby object and pass the monthly cohort and cohort index as a list
grouping = online.groupby(['CohortMonth', 'CohortIndex_M']) 

# Calculate the average of the unit price 
cohort_data = grouping['UnitPrice'].mean()

# Reset the index of cohort_data
cohort_data = cohort_data.reset_index()

# Create a pivot 
average_price = cohort_data.pivot(index='CohortMonth', columns='CohortIndex_M', values='UnitPrice')
print(average_price.round(1))

CohortIndex_M   1    2    3    4    5    6    7    8    9    10   11   12   13
CohortMonth                                                                   
2010-12-01     3.0  3.0  3.0  2.8  2.7  6.9  2.8  3.0  2.7  2.7  3.0  2.8  2.6
2011-01-01     3.2  3.1  3.0  3.0  3.1  3.0  3.0  2.5  2.7  2.9  2.6  2.0  NaN
2011-02-01     3.1  4.0  3.3  2.9  3.3  2.9  2.8  2.7  2.9  2.7  3.1  NaN  NaN
2011-03-01     3.5  3.6  3.5  2.8  2.7  2.5  2.7  2.9  2.5  2.4  NaN  NaN  NaN
2011-04-01     3.3  4.4  3.4  2.6  2.8  2.8  2.8  2.6  2.6  NaN  NaN  NaN  NaN
2011-05-01     3.1  2.8  2.5  2.7  2.5  2.3  2.7  2.3  NaN  NaN  NaN  NaN  NaN
2011-06-01     2.8  2.4  2.7  3.1  2.5  2.4  2.5  NaN  NaN  NaN  NaN  NaN  NaN
2011-07-01     3.2  3.1  3.4  2.5  2.4  2.3  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2011-08-01     2.9  3.7  5.4  6.9  4.2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2011-09-01     2.9  3.1  3.0  2.6  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2011-10-01     2.9  2.7  2.5  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2011-11-01     2.5  2.1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
2011-12-01     1.9  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN

Visualize average quantity metric¶

We are now going to visualize average quantity values in a heatmap.

# Create a groupby object and pass the monthly cohort and cohort index as a list
grouping = online.groupby(['CohortMonth', 'CohortIndex_M']) 

# Calculate the average of the quantity  
cohort_data = grouping['Quantity'].mean()

# Reset the index of cohort_data
cohort_data = cohort_data.reset_index()

# Create a pivot 
average_quantity = cohort_data.pivot(index='CohortMonth', columns='CohortIndex_M', values='Quantity')

import matplotlib.pyplot as plt
import seaborn as sns

# Initialize an 8 by 6 inches plot figure
fig, ax = plt.subplots(figsize=(12, 6))

# Add a title
ax.set_title('Average Spend by Monthly Cohorts')

# Create the heatmap
sns.heatmap(average_quantity, annot=True, fmt=".1f", annot_kws={"size": 12}, ax=ax)
b, t = ax.set_ylim() 
b += 0.5 
t -= 0.5 
ax.set_ylim(b, t) 
plt.show()

Recency, frequency, monetary (RFM) segmentation¶

What is RFM segmentation?

Behavioral customer segmentation based on three metrics:

Recency (R): time since last customer transaction
Frequency (F): number of purchases in the observed period
Monetary Value (M): total amount spent in the observed period

Grouping RFM values

The RFM values can be grouped in several ways:

Percentiles e.g. quantiles
Pareto $80/20$ cut
Custom - based on business knowledge

We are going to implement percentile-based grouping.

Short review of percentiles

Process of calculating percentiles:

Sort customers based on that metric
Break customers into a pre-defined number of groups of equal size
Assign a label to each group

Calculate RFM values¶

Calculate Recency, Frequency and Monetary values for the online dataset we have used before.

# Calculate the total sum 
online['TotalSum'] = online['Quantity'] * online['UnitPrice']

# Print min and max dates 
print('Min:{}; Max:{}'.format(min(online.InvoiceDate), max(online.InvoiceDate)))

Min:2010-12-01 08:26:00; Max:2011-12-09 12:49:00

# Create a hypothetical snapshot_day data as if we're doing analysis recently
snapshot_date = max(online.InvoiceDate) + dt.timedelta(days=1)
print(snapshot_date)

2011-12-10 12:49:00

# Calculate Recency, Frequency and Monetary value for each customer 
datamart = online.groupby(['CustomerID']).agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'count',
    'TotalSum': 'sum'})

# Rename the columns 
datamart.rename(columns={'InvoiceDate': 'Recency',
                         'InvoiceNo': 'Frequency',
                         'TotalSum': 'MonetaryValue'}, inplace=True)
# Print top 5 rows
datamart.head()

Calculate 3 groups for recency and frequency¶

We will now group the customers into three separate groups based on Recency, and Frequency.

We will use the result from the exercise in the next one, where we will group customers based on the MonetaryValue and finally calculate and RFM_Score.

# Create labels for Recency and Frequency
r_labels = range(3, 0, -1); f_labels = range(1, 4)

# Assign these labels to three equal percentile groups 
r_groups = pd.qcut(datamart['Recency'], q=3, labels=r_labels)

# Assign these labels to three equal percentile groups 
f_groups = pd.qcut(datamart['Frequency'], q=3, labels=f_labels)

# Create new columns R and F
datamart = datamart.assign(R=r_groups.values, F=f_groups.values)

Calculate RFM Score¶

We will now finish the job by assigning customers to three groups based on the MonetaryValue percentiles and then calculate an RFM_Score which is a sum of the $R$, $F$, and $M$ values.

# Create labels for MonetaryValue 
m_labels = range(1, 4)

# Assign these labels to three equal percentile groups
m_groups = pd.qcut(datamart['MonetaryValue'], q=3, labels=m_labels)

# Create new column M
datamart = datamart.assign(M=m_groups.values)

# Calculate RFM_Score
datamart['RFM_Score'] = datamart[['R','F','M']].sum(axis=1)

datamart.head()

Creating custom segments¶

Here we will create a custom segmentation based on RFM_Score values. We will create a function to build segmentation and then assign it to each customer.

# Define rfm_level function
def rfm_level(df):
    if df['RFM_Score'] >= 9:
        return 'Top'
    elif ((df['RFM_Score'] >= 5) and (df['RFM_Score'] < 9)):
        return 'Middle'
    else:
        return 'Low'

# Create a new variable RFM_Level
datamart['RFM_Level'] = datamart.apply(rfm_level, axis=1)

# Print the header with the top 5 rows to the console.
datamart.head()

Analyzing custom segments¶

As a final step, we will analyze average values of Recency, Frequency and MonetaryValue for the custom segments we've created.

# Calculate average values for each RFM_Level, and return a size of each segment 
rfm_level_agg = datamart.groupby('RFM_Level').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
  
    # Return the size of each segment
    'MonetaryValue': ['mean', 'count']}).round(1)

# Print the aggregated dataset
rfm_level_agg

Data pre-processing for k-means clustering¶

Advantages of k-means clustering

One of the most popular unsupervised learning method
Simple and fast
Works well

with certain assumptions about the data

Key k-means assumptions

Symmetric distribution of variables (not skewed)
Variables with same average values
Variables with same variance

Variables on the same scale

K-means assumes equal mean
And equal variance
It's not the case with RFM data

Visualize RFM distributions¶

We will now explore the RFM distributions.

datamart_rfm = datamart[['Recency', 'Frequency', 'MonetaryValue']]

# Plot recency distribution
plt.subplot(3, 1, 1); sns.distplot(datamart_rfm['Recency'])

# Plot frequency distribution
plt.subplot(3, 1, 2); sns.distplot(datamart_rfm['Frequency'])

# Plot monetary value distribution
plt.subplot(3, 1, 3); sns.distplot(datamart_rfm['MonetaryValue'])

# Show the plot
plt.tight_layout()
plt.show()

Pre-process RFM data¶

Since the variables are skewed and are on different scales, we will now un-skew and scale them.

from sklearn.preprocessing import StandardScaler

# Unskew the data
datamart_log = np.log(datamart_rfm)

# Initialize a standard scaler and fit it
scaler = StandardScaler()
scaler.fit(datamart_log)

# Scale and center the data
datamart_normalized = scaler.transform(datamart_log)

# Create a pandas DataFrame
datamart_normalized = pd.DataFrame(data=datamart_normalized, index=datamart_rfm.index, 
                                   columns=datamart_rfm.columns)

Visualize the normalized variables¶

Now we will plot the normalized and unskewed variables to see the difference in the distribution as well as the range of the values.

# Plot recency distribution
plt.subplot(3, 1, 1); sns.distplot(datamart_normalized['Recency'])

# Plot frequency distribution
plt.subplot(3, 1, 2); sns.distplot(datamart_normalized['Frequency'])

# Plot monetary value distribution
plt.subplot(3, 1, 3); sns.distplot(datamart_normalized['MonetaryValue'])

# Show the plot
plt.tight_layout()
plt.show()

Run k-means¶

We will now build a $3$ cluster k-means model.

# Import KMeans 
from sklearn.cluster import KMeans

# Initialize KMeans
kmeans = KMeans(n_clusters=3, random_state=1) 

# Fit k-means clustering on the normalized data set
kmeans.fit(datamart_normalized)

# Extract cluster labels
cluster_labels = kmeans.labels_

Assign labels to raw data¶

We will now analyze the average RFM values of the three clusters we've created in the previous exercise.

# Create a DataFrame by adding a new cluster label column
datamart_rfm_k3 = datamart_rfm.assign(Cluster=cluster_labels)

# Group the data by cluster
grouped = datamart_rfm_k3.groupby(['Cluster'])

# Calculate average RFM values and segment sizes per cluster value
grouped.agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'MonetaryValue': ['mean', 'count']
  }).round(1)

You can immediately see the differences in RFM values of these segments!

Choosing number of clusters¶

Methods

Visual methods - elbow criterion
Mathematical methods - silhouette coefficient
Experimentation and interpretation

Elbow criterion method

Plot the number of clusters against within-cluster sum-of-squared-errors (SSE) - sum of squared distances from every data point to their cluster center
Identify an "elbow" in the plot
Elbow - a point representing an "optimal" number of clusters

Experimental approach - analyze segments

Build clustering at and around elbow solution
Analyze their properties - average RFM values
Compare against each other and choose one which makes most business sense

Calculate sum of squared errors¶

In this exercise, we will calculate the sum of squared errors for different number of clusters ranging from $1$ to $20$.

sse = dict()

# Fit KMeans and calculate SSE for each k
for k in range(1, 21):
  
    # Initialize KMeans with k clusters
    kmeans = KMeans(n_clusters=k, random_state=1)
    
    # Fit KMeans on the normalized dataset
    kmeans.fit(datamart_normalized)
    
    # Assign sum of squared distances to k element of dictionary
    sse[k] = kmeans.inertia_

Plot sum of squared errors¶

Now we will plot the sum of squared errors for each value of $k$ and identify if there is an elbow. This will guide us towards the recommended number of clusters to use.

# Add the plot title "The Elbow Method"
plt.title('The Elbow Method')

# Add X-axis label "k"
plt.xlabel('k')

# Add Y-axis label "SSE"
plt.ylabel('SSE')

# Plot SSE values for each key in the dictionary
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()

You can see the elbow is clearly around $2-3$ clusters!

Compute the mean Silhouette Coefficient of all samples.¶

The Silhouette Coefficient is calculated using the mean intra-cluster distance $(a)$ and the mean nearest-cluster distance $(b)$ for each sample. The Silhouette Coefficient for a sample is

$$s = \frac{b - a}{max(a, b)}$$

To clarify, $b$ is the distance between a sample and the nearest cluster that the sample is not a part of. $a$ is the mean distance between a sample and all other points in the same class. Note that Silhouette Coefficient is only defined if number of labels is $2$ <= n_labels <= n_samples $- 1$.

The best value is $1$ and the worst value is $-1$. Values near $0$ indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

Now we will plot the silhouette coefficients for each value of $k$ and identify if the highest value. This will guide us towards the recommended number of clusters to use.

from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm

X = datamart_normalized.values

range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:
    
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7, c=colors, edgecolor='k')
    
    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o', c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

For n_clusters = 2 The average silhouette_score is : 0.3898282199798433
For n_clusters = 3 The average silhouette_score is : 0.3067028493682284
For n_clusters = 4 The average silhouette_score is : 0.2957223549730578
For n_clusters = 5 The average silhouette_score is : 0.2811533972803104
For n_clusters = 6 The average silhouette_score is : 0.2783496283019491

Profile and interpret segments¶

Approaches to build customer personas

Summary statistics for each cluster e.g. average RFM values
Snake plots (from market research)
Relative importance of cluster attributes compared to population

Snake plots to understand and compare segments

Market research technique to compare different segments
Visual representation of each segment's attributes
Need to first normalize data (center & scale)
Plot each cluster's average normalized values of each attribute

Prepare data for the snake plot¶

Now we will prepare data for the snake plot. We will use the 3-cluster RFM segmentation solution we have built previously. We will transform the normalized RFM data into a long format by "melting" the metric columns into two columns - one for the name of the metric, and another for the actual numeric value.

datamart_normalized['Cluster'] = datamart_rfm_k3['Cluster']

# Melt the normalized dataset and reset the index
datamart_melt = pd.melt(datamart_normalized.reset_index(), 
                        
                    # Assign CustomerID and Cluster as ID variables                  
                    id_vars=['CustomerID', 'Cluster'],

                    # Assign RFM values as value variables
                    value_vars=['Recency', 'Frequency', 'MonetaryValue'], 
                        
                    # Name the variable and value
                    var_name='Metric', value_name='Value')

datamart_melt.head()

Visualize snake plot¶

Now we will now use the melted dataset to build the snake plot.

# Add the plot title
plt.title('Snake plot of normalized variables')

# Add the x axis label
plt.xlabel('Metric')

# Add the y axis label
plt.ylabel('Value')

# Plot a line for each value of the cluster variable
sns.lineplot(data=datamart_melt, x='Metric', y='Value', hue='Cluster')
plt.show()

Calculate relative importance of each attribute¶

Relative importance of segment attributes

Useful technique to identify relative importance of each segment's attribute
Calculate average values of each cluster
Calculate average values of population
Calculate importance score by dividing them and subtracting $1$ (ensures $0$ is returned when cluster average equals population average)

Now we will calculate the relative importance of the RFM values within each cluster.

# Calculate average RFM values for each cluster
cluster_avg = datamart_rfm_k3.groupby(['Cluster']).mean() 

# Calculate average RFM values for the total customer population
population_avg = datamart_rfm.mean()

# Calculate relative importance of cluster's attribute value compared to population
relative_imp = cluster_avg / population_avg - 1

# Print relative importance score rounded to 2 decimals
relative_imp.round(2)

As the ratio moves away from $0$, attribute importance for a segment (relative to total pop.) increases.

Plot relative importance heatmap¶

Now we will build a heatmap visualizing the relative scores for each cluster.

# Initialize a plot with a figure size of 8 by 2 inches 
plt.figure(figsize=(8, 2))

# Add the plot title
plt.title('Relative importance of attributes')

# Plot the heatmap
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn', annot_kws={"size": 16})
b, t = plt.ylim() 
b += 0.5 
t -= 0.5 
plt.ylim(b, t) 
plt.show()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	InvoiceDay	CohortDay
416792	572558	22745	POPPY'S PLAYHOUSE BEDROOM	6	2011-10-25 08:26:00	2.10	14286	United Kingdom	2011-10-25	2011-04-11
482904	577485	23196	VINTAGE LEAF MAGNETIC NOTEPAD	1	2011-11-20 11:56:00	1.45	16360	United Kingdom	2011-11-20	2011-09-12
263743	560034	23299	FOOD COVER WITH BEADS SET 2	6	2011-07-14 13:35:00	3.75	13933	United Kingdom	2011-07-14	2011-07-14
495549	578307	72349B	SET/6 PURPLE BUTTERFLY T-LIGHTS	1	2011-11-23 15:53:00	2.10	17290	United Kingdom	2011-11-23	2011-11-23
204384	554656	21756	BATH BUILDING BLOCK WORD	3	2011-05-25 13:36:00	5.95	17663	United Kingdom	2011-05-25	2011-02-25

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	InvoiceDay	CohortDay	CohortIndex
416792	572558	22745	POPPY'S PLAYHOUSE BEDROOM	6	2011-10-25 08:26:00	2.10	14286	United Kingdom	2011-10-25	2011-04-11	195
482904	577485	23196	VINTAGE LEAF MAGNETIC NOTEPAD	1	2011-11-20 11:56:00	1.45	16360	United Kingdom	2011-11-20	2011-09-12	69
263743	560034	23299	FOOD COVER WITH BEADS SET 2	6	2011-07-14 13:35:00	3.75	13933	United Kingdom	2011-07-14	2011-07-14	1
495549	578307	72349B	SET/6 PURPLE BUTTERFLY T-LIGHTS	1	2011-11-23 15:53:00	2.10	17290	United Kingdom	2011-11-23	2011-11-23	1
204384	554656	21756	BATH BUILDING BLOCK WORD	3	2011-05-25 13:36:00	5.95	17663	United Kingdom	2011-05-25	2011-02-25	91

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	InvoiceDay	CohortDay	CohortIndex	CohortIndex_M	InvoiceMonth	CohortMonth
416792	572558	22745	POPPY'S PLAYHOUSE BEDROOM	6	2011-10-25 08:26:00	2.10	14286	United Kingdom	2011-10-25	2011-04-11	195	7	2011-10-01	2011-04-01
482904	577485	23196	VINTAGE LEAF MAGNETIC NOTEPAD	1	2011-11-20 11:56:00	1.45	16360	United Kingdom	2011-11-20	2011-09-12	69	3	2011-11-01	2011-09-01
263743	560034	23299	FOOD COVER WITH BEADS SET 2	6	2011-07-14 13:35:00	3.75	13933	United Kingdom	2011-07-14	2011-07-14	1	1	2011-07-01	2011-07-01
495549	578307	72349B	SET/6 PURPLE BUTTERFLY T-LIGHTS	1	2011-11-23 15:53:00	2.10	17290	United Kingdom	2011-11-23	2011-11-23	1	1	2011-11-01	2011-11-01
204384	554656	21756	BATH BUILDING BLOCK WORD	3	2011-05-25 13:36:00	5.95	17663	United Kingdom	2011-05-25	2011-02-25	91	4	2011-05-01	2011-02-01

	Recency	Frequency	MonetaryValue
CustomerID
12747	2	27	992.82
12748	1	967	7522.06
12749	4	37	813.45
12820	3	17	268.02
12822	71	9	146.15

	Recency	Frequency	MonetaryValue	R	F	M	RFM_Score
CustomerID
12747	2	27	992.82	3	3	3	9.0
12748	1	967	7522.06	3	3	3	9.0
12749	4	37	813.45	3	3	3	9.0
12820	3	17	268.02	3	3	3	9.0
12822	71	9	146.15	2	2	2	6.0

CohortIndex_M	1	2	3	4	5	6	7	8	9	10	11	12	13
CohortMonth
2010-12-01	716.0	246.0	221.0	251.0	245.0	285.0	249.0	236.0	240.0	265.0	254.0	348.0	172.0
2011-01-01	332.0	69.0	82.0	81.0	110.0	90.0	82.0	86.0	104.0	102.0	124.0	45.0	NaN
2011-02-01	316.0	58.0	57.0	83.0	85.0	74.0	80.0	83.0	86.0	95.0	28.0	NaN	NaN
2011-03-01	388.0	63.0	100.0	76.0	83.0	67.0	98.0	85.0	107.0	38.0	NaN	NaN	NaN
2011-04-01	255.0	49.0	52.0	49.0	47.0	52.0	56.0	59.0	17.0	NaN	NaN	NaN	NaN
2011-05-01	249.0	40.0	43.0	36.0	52.0	58.0	61.0	22.0	NaN	NaN	NaN	NaN	NaN
2011-06-01	207.0	33.0	26.0	41.0	49.0	62.0	19.0	NaN	NaN	NaN	NaN	NaN	NaN
2011-07-01	173.0	28.0	31.0	38.0	44.0	17.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-08-01	139.0	30.0	28.0	35.0	14.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-09-01	279.0	56.0	78.0	34.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-10-01	318.0	67.0	30.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-11-01	291.0	32.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-12-01	38.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	Recency	Frequency	MonetaryValue
	mean	mean	mean	count
RFM_Level
Low	186.3	3.3	50.8	1054
Middle	70.5	15.1	286.0	2081
Top	10.3	63.7	1342.1	566

	CustomerID	Cluster	Metric	Value
0	12747	0	Recency	-2.195642
1	12748	0	Recency	-2.684553
2	12749	0	Recency	-1.706731
3	12820	0	Recency	-1.909647
4	12822	2	Recency	0.322128

	Recency	Frequency	MonetaryValue
	mean	mean	mean	count
Cluster
0	13.8	52.7	1128.1	855
1	171.7	3.1	55.3	1224
2	78.3	13.6	231.9	1622

	Recency	Frequency	MonetaryValue
Cluster
0	-0.85	1.75	1.96
1	0.82	-0.84	-0.85
2	-0.17	-0.29	-0.39