Credit Card Clustering for Market Segmentation

Kar Ng 2022-05

1 SUMMARY
2 R PACKAGES
3 INTRODUCTION
4 DATA PREPARATION
5 DATA CLEANING AND MANIPULATION
- 5.1 Rename all variables
- 5.2 NA Removal
6 EDA
7 CLUSTERING
8 CONCLUSION
9 REFERENCE

1 SUMMARY

This project is trying to cluster a big credit card dataset that has 17 variables and 9000 rows of data. Prior to clustering, the dataset was cleaned, featured-selected using R-squared for multicollinearity assessment, and transformed via standardisation.

Principal component analysis (PCA) was used for exploratory data analysis (EDA) and found that the amount of transaction for cash in advance, credit limit, magnitude of payments paid back to the credit card company, magnitude of purchases made, and the percentage of full payment made by credit card users are important variables (or known as features) to look at in this analysis. Other variables can be important supportive information such as preference of making minimum payments and purchase frequency in oneoff-nature or installment-nature.

5 different machine learning clustering algorithms are performed, which are Clustering for Large Application (CLARA), Hierarchical K-means Clustering Hybrid (HKmeans), Fuzzy clustering, Model-based Clustering, Density-Based Spatial Clustering for Application with Noise (DBSCAN). Analysis outcome shows that VEV model suggested by model-based clustering is the best clustering model that is able to detect 8 distinct group of credit card users, which is also the largest amount suggested compared to other algorithms. These groups are:

Cluster 1: Less active user that prefer cash-in-advance.
Cluster 2: Less active user that prefer to use credit card to make purchases, especially installment purchases.
Cluster 3: Revolvers who prefer to make expensive purchases.
Cluster 4: Less active users who prefer to make full repayment back to credit card company.
Cluster 5: Active card users who make expensive purchases and make repayment in big amount sometime they prefer full payment, sometime they pay minimum payments.
Cluster 6: Max payers who prefer to pay money owe in full with zero tenure.
Cluster 7: Less active revolvers, they spend small amount of money to purchase cheaper products.
Cluster 8: Max Payer but this users group is more active in making purchases than cluster 6.

Highlight

2 R PACKAGES

Following R packages are loaded for this project.

set.seed(123)

library(tidyverse)
library(kableExtra)
library(skimr)
library(factoextra)
library(hopkins)
library(clValid)
library(NbClust)
library(mclust)
library(cowplot)
library(dbscan)
library(fpc)
library(corrplot)
library(FactoMineR)
library(ggiraphExtra)   # for radar 
library(GGally)         # for ggparcoord
library(caret)          # findCorrelation
library(ggrepel)

3 INTRODUCTION

This project will use various type of clustering techniques to develop customer segmentation to define marketing strategy.

Dataset used in this project is called “Credit Card Dataset for Clustering” by Arjun Bhasin. It is a public dataset from Kaggle.com.

The dataset has 17 behavioral information of 9000 active credit card holders.

4 DATA PREPARATION

4.1 Data Import

Following codes import the dataset and specify the first column as row’s name. Having the first column as row name is a requirement to perform clustering.

cc <- read.csv("cc_dataset.csv",
               header = T,
               row.names = 1)   # specificy column 1 as row name

Randomly sample the first 10 rows of the dataset.

sample_n(cc, 10) %>% 
  kbl() %>% 
  kable_material_dark()

	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE
C12536	1417.9792	1.000000	0.00	0.00	0.00	97.10606	0.000000	0.000000	0.000000	0.285714	4	0	1500	201.4252	233.5456	0.000000	7
C12584	147.4186	0.181818	1074.00	1074.00	0.00	0.00000	0.083333	0.083333	0.000000	0.000000	0	2	5500	0.0000	NA	0.000000	12
C18954	220.9437	1.000000	1833.34	0.00	1833.34	0.00000	0.916667	0.000000	0.916667	0.000000	0	11	1000	1652.7554	176.8253	0.818182	12
C13073	1820.2166	1.000000	4877.66	2341.16	2536.50	0.00000	1.000000	1.000000	1.000000	0.000000	0	154	4500	1297.9945	408.2934	0.166667	12
C11902	5941.2712	1.000000	279.36	0.00	279.36	905.98283	0.666667	0.000000	0.666667	0.083333	2	16	6000	3140.4024	2949.0011	0.000000	12
C13466	820.1633	1.000000	680.14	0.00	680.14	0.00000	0.750000	0.000000	0.666667	0.000000	0	18	1200	1178.2472	219.8759	0.000000	12
C14894	3755.1840	1.000000	3211.86	1420.58	1791.28	0.00000	1.000000	0.750000	0.833333	0.000000	0	22	11000	3198.2619	960.8566	0.000000	12
C16933	156.7233	1.000000	1776.76	0.00	1776.76	0.00000	1.000000	0.000000	0.916667	0.000000	0	22	9600	1339.3031	185.2810	0.916667	12
C12835	3057.5537	1.000000	0.00	0.00	0.00	2037.64807	0.000000	0.000000	0.000000	0.250000	3	0	12500	7265.9761	847.1706	0.000000	12
C15251	1852.6464	1.000000	2600.00	2200.00	400.00	3481.13268	0.333333	0.166667	0.083333	0.083333	8	4	6000	1362.4258	727.4295	0.100000	12

The name of all variables are:

names(cc)

##  [1] "BALANCE"                          "BALANCE_FREQUENCY"               
##  [3] "PURCHASES"                        "ONEOFF_PURCHASES"                
##  [5] "INSTALLMENTS_PURCHASES"           "CASH_ADVANCE"                    
##  [7] "PURCHASES_FREQUENCY"              "ONEOFF_PURCHASES_FREQUENCY"      
##  [9] "PURCHASES_INSTALLMENTS_FREQUENCY" "CASH_ADVANCE_FREQUENCY"          
## [11] "CASH_ADVANCE_TRX"                 "PURCHASES_TRX"                   
## [13] "CREDIT_LIMIT"                     "PAYMENTS"                        
## [15] "MINIMUM_PAYMENTS"                 "PRC_FULL_PAYMENT"                
## [17] "TENURE"

4.2 Data Description

Following is the data description extracted from the relevant Kaggle webpage.

Variables <- c("CUSTID", "BALANCE", "BALANCEFREQUENCY", "PURCHASES", "ONEOFFPURCHASES",
               "INSTALLMENTSPURCHASES", "CASHADVANCE", "PURCHASESFREQUENCY", "ONEOFFPURCHASESFREQUENCY", "PURCHASESINSTALLMENTSFREQUENCY", "CASHADVANCEFREQUENCY", "CASHADVANCETRX", "PURCHASESTRX", "CREDITLIMIT", "PAYMENTS", "MINIMUM_PAYMENTS", "PRCFULLPAYMENT", "TENURE")

Description <- c("Identification of Credit Card holder (Categorical)",
                 "Balance amount left in their account to make purchases",
                 "How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)",
                 "Amount of purchases made from account",
                 "Maximum purchase amount done in one-go",
                 "Amount of purchase done in installment",
                 "Cash in advance given by the user",
                 "How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)",
                 "How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)",
                 "How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)",
                 "How frequently the cash in advance being paid",
                 "Number of Transactions made with Cash in Advanced",
                 "Number of purchase transactions made",
                 "Limit of Credit Card for user",
                 "Amount of Payment done by user",
                 "Minimum amount of payments made by user",
                 "Percent of full payment paid by user",
                 "Tenure of credit card service for user")


data.frame(Variables, Description) %>% 
  kbl() %>% 
  kable_material_dark()

Variables	Description
CUSTID	Identification of Credit Card holder (Categorical)
BALANCE	Balance amount left in their account to make purchases
BALANCEFREQUENCY	How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
PURCHASES	Amount of purchases made from account
ONEOFFPURCHASES	Maximum purchase amount done in one-go
INSTALLMENTSPURCHASES	Amount of purchase done in installment
CASHADVANCE	Cash in advance given by the user
PURCHASESFREQUENCY	How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
ONEOFFPURCHASESFREQUENCY	How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
PURCHASESINSTALLMENTSFREQUENCY	How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
CASHADVANCEFREQUENCY	How frequently the cash in advance being paid
CASHADVANCETRX	Number of Transactions made with Cash in Advanced
PURCHASESTRX	Number of purchase transactions made
CREDITLIMIT	Limit of Credit Card for user
PAYMENTS	Amount of Payment done by user
MINIMUM_PAYMENTS	Minimum amount of payments made by user
PRCFULLPAYMENT	Percent of full payment paid by user
TENURE	Tenure of credit card service for user

4.3 Data Exploration

Data Size and type

The dataset has 8950 rows and 17 variables. Following summary displays data type of each column and their initial values.

glimpse(cc)

## Rows: 8,950
## Columns: 17
## $ BALANCE                          <dbl> 40.90075, 3202.46742, 2495.14886, 166~
## $ BALANCE_FREQUENCY                <dbl> 0.818182, 0.909091, 1.000000, 0.63636~
## $ PURCHASES                        <dbl> 95.40, 0.00, 773.17, 1499.00, 16.00, ~
## $ ONEOFF_PURCHASES                 <dbl> 0.00, 0.00, 773.17, 1499.00, 16.00, 0~
## $ INSTALLMENTS_PURCHASES           <dbl> 95.40, 0.00, 0.00, 0.00, 0.00, 1333.2~
## $ CASH_ADVANCE                     <dbl> 0.0000, 6442.9455, 0.0000, 205.7880, ~
## $ PURCHASES_FREQUENCY              <dbl> 0.166667, 0.000000, 1.000000, 0.08333~
## $ ONEOFF_PURCHASES_FREQUENCY       <dbl> 0.000000, 0.000000, 1.000000, 0.08333~
## $ PURCHASES_INSTALLMENTS_FREQUENCY <dbl> 0.083333, 0.000000, 0.000000, 0.00000~
## $ CASH_ADVANCE_FREQUENCY           <dbl> 0.000000, 0.250000, 0.000000, 0.08333~
## $ CASH_ADVANCE_TRX                 <int> 0, 4, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ PURCHASES_TRX                    <int> 2, 0, 12, 1, 1, 8, 64, 12, 5, 3, 12, ~
## $ CREDIT_LIMIT                     <dbl> 1000, 7000, 7500, 7500, 1200, 1800, 1~
## $ PAYMENTS                         <dbl> 201.8021, 4103.0326, 622.0667, 0.0000~
## $ MINIMUM_PAYMENTS                 <dbl> 139.50979, 1072.34022, 627.28479, NA,~
## $ PRC_FULL_PAYMENT                 <dbl> 0.000000, 0.222222, 0.000000, 0.00000~
## $ TENURE                           <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 1~

The “dbl” and “ind” are data type allocated by R to particular columns that have these characteristics. The “dbl” stands for “double”, it is used for numerical variables that have decimal places. The “int” stands for integer, it is used for numerical variables that have integer values.

From the above summary, I can see that all variables are numeric with either “dbl” and “int”.

Purchases

I found that the variable “PURCHASES” is the sum of “ONEOFF_PURCHASES” and “INSTALLMENTS_PURCHASES”. This finding may not be important in this analysis. Feature selection will be carried out in next section.

Following codes select 10 rows among the dataset, and the temporary new variable “MY_PURCHASES” (only in this section) proves my finding, which has the same values as the “PURCHASES”.

cc %>% 
  top_n(10, BALANCE) %>% 
  dplyr::select(ONEOFF_PURCHASES, INSTALLMENTS_PURCHASES, PURCHASES) %>% 
  mutate(MY_PURCHASES = ONEOFF_PURCHASES + INSTALLMENTS_PURCHASES) %>% 
  kbl() %>% 
  kable_material_dark()

	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	PURCHASES	MY_PURCHASES
C10144	9449.07	12560.85	22009.92	22009.92
C10544	529.30	0.00	529.30	529.30
C10609	7564.81	258.93	7823.74	7823.74
C10914	0.00	0.00	0.00	0.00
C12434	0.00	1168.75	1168.75	1168.75
C14256	3657.30	1630.98	5288.28	5288.28
C14836	717.24	0.00	717.24	717.24
C15429	105.30	579.44	684.74	684.74
C15642	0.00	1770.57	1770.57	1770.57
C16812	3582.45	1442.23	5024.68	5024.68

Missing values check

By examining the variables “n_missing” and “complete_rate” in following tables, there is:

1 missing value in the variable “CREDIT_LIMIT”
313 missing values in the “MINIMUM_PAYMENTS”

skim_without_charts(cc)

Data summary


Name	cc
Number of rows	8950
Number of columns	17
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
Column type frequency:
numeric	17
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
BALANCE	0	1.00	1564.47	2081.53	0.00	128.28	873.39	2054.14	19043.14
BALANCE_FREQUENCY	0	1.00	0.88	0.24	0.00	0.89	1.00	1.00	1.00
PURCHASES	0	1.00	1003.20	2136.63	0.00	39.63	361.28	1110.13	49039.57
ONEOFF_PURCHASES	0	1.00	592.44	1659.89	0.00	0.00	38.00	577.41	40761.25
INSTALLMENTS_PURCHASES	0	1.00	411.07	904.34	0.00	0.00	89.00	468.64	22500.00
CASH_ADVANCE	0	1.00	978.87	2097.16	0.00	0.00	0.00	1113.82	47137.21
PURCHASES_FREQUENCY	0	1.00	0.49	0.40	0.00	0.08	0.50	0.92	1.00
ONEOFF_PURCHASES_FREQUENCY	0	1.00	0.20	0.30	0.00	0.00	0.08	0.30	1.00
PURCHASES_INSTALLMENTS_FREQUENCY	0	1.00	0.36	0.40	0.00	0.00	0.17	0.75	1.00
CASH_ADVANCE_FREQUENCY	0	1.00	0.14	0.20	0.00	0.00	0.00	0.22	1.50
CASH_ADVANCE_TRX	0	1.00	3.25	6.82	0.00	0.00	0.00	4.00	123.00
PURCHASES_TRX	0	1.00	14.71	24.86	0.00	1.00	7.00	17.00	358.00
CREDIT_LIMIT	1	1.00	4494.45	3638.82	50.00	1600.00	3000.00	6500.00	30000.00
PAYMENTS	0	1.00	1733.14	2895.06	0.00	383.28	856.90	1901.13	50721.48
MINIMUM_PAYMENTS	313	0.97	864.21	2372.45	0.02	169.12	312.34	825.49	76406.21
PRC_FULL_PAYMENT	0	1.00	0.15	0.29	0.00	0.00	0.00	0.14	1.00
TENURE	0	1.00	11.52	1.34	6.00	12.00	12.00	12.00	12.00

Alternatively, following code performs the missing-value check.

colSums(is.na(cc)) %>% 
  kbl(col.names = "Numbers of Missing Values") %>% 
  kable_material_dark(full_width = F)

	Numbers of Missing Values
BALANCE	0
BALANCE_FREQUENCY	0
PURCHASES	0
ONEOFF_PURCHASES	0
INSTALLMENTS_PURCHASES	0
CASH_ADVANCE	0
PURCHASES_FREQUENCY	0
ONEOFF_PURCHASES_FREQUENCY	0
PURCHASES_INSTALLMENTS_FREQUENCY	0
CASH_ADVANCE_FREQUENCY	0
CASH_ADVANCE_TRX	0
PURCHASES_TRX	0
CREDIT_LIMIT	1
PAYMENTS	0
MINIMUM_PAYMENTS	313
PRC_FULL_PAYMENT	0
TENURE	0

There are 8950 rows of data and I will still have 96.5% of data remain after removal of these missing values and therefore I will simply remove these missing values for the simplicity of this project.

Typically, missing value can be handled by either removal, replaced with mean, median, or using imputation algorithm such as KNN or bagging algorithm. These techniques are usually performed when there are too many missing values in important variables. For example, when missing values of a variable is higher than 5% but less than 60%.

Summary

Visualising the statistical distribution of each variable:

summary(cc) %>% kbl() %>% kable_material_dark()

BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE
Min. : 0.0	Min. :0.0000	Min. : 0.00	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. :0.00000	Min. :0.00000	Min. :0.0000	Min. :0.0000	Min. : 0.000	Min. : 0.00	Min. : 50	Min. : 0.0	Min. : 0.02	Min. :0.0000	Min. : 6.00
1st Qu.: 128.3	1st Qu.:0.8889	1st Qu.: 39.63	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.: 0.0	1st Qu.:0.08333	1st Qu.:0.00000	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.: 0.000	1st Qu.: 1.00	1st Qu.: 1600	1st Qu.: 383.3	1st Qu.: 169.12	1st Qu.:0.0000	1st Qu.:12.00
Median : 873.4	Median :1.0000	Median : 361.28	Median : 38.0	Median : 89.0	Median : 0.0	Median :0.50000	Median :0.08333	Median :0.1667	Median :0.0000	Median : 0.000	Median : 7.00	Median : 3000	Median : 856.9	Median : 312.34	Median :0.0000	Median :12.00
Mean : 1564.5	Mean :0.8773	Mean : 1003.20	Mean : 592.4	Mean : 411.1	Mean : 978.9	Mean :0.49035	Mean :0.20246	Mean :0.3644	Mean :0.1351	Mean : 3.249	Mean : 14.71	Mean : 4494	Mean : 1733.1	Mean : 864.21	Mean :0.1537	Mean :11.52
3rd Qu.: 2054.1	3rd Qu.:1.0000	3rd Qu.: 1110.13	3rd Qu.: 577.4	3rd Qu.: 468.6	3rd Qu.: 1113.8	3rd Qu.:0.91667	3rd Qu.:0.30000	3rd Qu.:0.7500	3rd Qu.:0.2222	3rd Qu.: 4.000	3rd Qu.: 17.00	3rd Qu.: 6500	3rd Qu.: 1901.1	3rd Qu.: 825.49	3rd Qu.:0.1429	3rd Qu.:12.00
Max. :19043.1	Max. :1.0000	Max. :49039.57	Max. :40761.2	Max. :22500.0	Max. :47137.2	Max. :1.00000	Max. :1.00000	Max. :1.0000	Max. :1.5000	Max. :123.000	Max. :358.00	Max. :30000	Max. :50721.5	Max. :76406.21	Max. :1.0000	Max. :12.00
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA’s :1	NA	NA’s :313	NA	NA

5 DATA CLEANING AND MANIPULATION

5.1 Rename all variables

The name of all variables are in capital form and which would be difficult to read for readers and also myself.

colnames(cc) %>% 
  kbl(col.names = "Variables") %>% 
  kable_material_dark()

Variables
BALANCE
BALANCE_FREQUENCY
PURCHASES
ONEOFF_PURCHASES
INSTALLMENTS_PURCHASES
CASH_ADVANCE
PURCHASES_FREQUENCY
ONEOFF_PURCHASES_FREQUENCY
PURCHASES_INSTALLMENTS_FREQUENCY
CASH_ADVANCE_FREQUENCY
CASH_ADVANCE_TRX
PURCHASES_TRX
CREDIT_LIMIT
PAYMENTS
MINIMUM_PAYMENTS
PRC_FULL_PAYMENT
TENURE

Following code transforms all the names into reader-friendly form.

cc <- cc %>% 
  rename_all(str_to_sentence)

Checking again the name of each variable.

colnames(cc) %>% 
  kbl(col.names = "Variables") %>% 
  kable_material_dark()

Variables
Balance
Balance_frequency
Purchases
Oneoff_purchases
Installments_purchases
Cash_advance
Purchases_frequency
Oneoff_purchases_frequency
Purchases_installments_frequency
Cash_advance_frequency
Cash_advance_trx
Purchases_trx
Credit_limit
Payments
Minimum_payments
Prc_full_payment
Tenure

5.2 NA Removal

Following code removes all the missing values in the dataset (314 rows among 8950 rows).

cc <- cc %>% 
  na.omit()

Now, the number of rows have been reduced to 8636 from 8950.

nrow(cc)

## [1] 8636

6 EDA

During Exploratory Data Analysis (EDA), trends in the dataset will be explored and the data would be transformed if required.

6.1 Histogram

A histogram is plotted to investigate the distribution of the data.

# data frame

df5.1 <- cc %>% 
  pivot_longer(c(1:17), 
               names_to = "my.variable", 
               values_to = "my.value")

# graphs

ggplot(df5.1, aes(x = my.value, fill = my.variable)) +
  geom_histogram(color = "black") +
  facet_wrap(~my.variable, scale = "free") +
  theme_minimal() +
  theme(legend.position = "none", 
        axis.text.x = element_text(angle = 45),
        plot.title = element_text(face = "bold", hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, face = "bold")) +
  labs(title = "Variable Distribution Analysis",
       subtitle = "by Histogram")

Many columns that is skewed and having large values. Usually, one would log these variables, however, since I will transform them using standardisation in later section, I will not log-transform them.

6.2 Correlation Check

Following plot shows the relationship between variables. Correlated variables means redundancy and I will be removing variables from any variable pairs that have a correlation value greater than a certain threshold. I will be using 0.8 as the cut-off point.

Graphically:

cor_cc <- cor(cc, method = "spearman")

corrplot(cor_cc, type = "lower",
         method = "number",
         tl.cex = 0.6, 
         tl.col = "purple",
         number.cex = 0.6,
         pch = 20)

Construct a correlation table to look at relevant correlation statistics:

cor_cc %>% 
  round(2) %>% 
  as.data.frame() %>% 
  kbl() %>% 
  kable_material_dark()

	Balance	Balance_frequency	Purchases	Oneoff_purchases	Installments_purchases	Cash_advance	Purchases_frequency	Oneoff_purchases_frequency	Purchases_installments_frequency	Cash_advance_frequency	Cash_advance_trx	Purchases_trx	Credit_limit	Payments	Minimum_payments	Prc_full_payment	Tenure
Balance	1.00	0.51	-0.01	0.13	-0.10	0.57	-0.16	0.10	-0.16	0.54	0.55	-0.06	0.38	0.42	0.90	-0.53	0.06
Balance_frequency	0.51	1.00	0.13	0.12	0.12	0.13	0.20	0.14	0.15	0.17	0.17	0.19	0.10	0.16	0.50	-0.22	0.23
Purchases	-0.01	0.13	1.00	0.75	0.71	-0.39	0.79	0.69	0.61	-0.40	-0.39	0.89	0.26	0.40	-0.01	0.23	0.13
Oneoff_purchases	0.13	0.12	0.75	1.00	0.21	-0.19	0.43	0.95	0.13	-0.19	-0.18	0.60	0.31	0.37	0.07	0.04	0.10
Installments_purchases	-0.10	0.12	0.71	0.21	1.00	-0.36	0.79	0.19	0.92	-0.37	-0.36	0.78	0.13	0.23	-0.05	0.27	0.12
Cash_advance	0.57	0.13	-0.39	-0.19	-0.36	1.00	-0.45	-0.19	-0.38	0.94	0.95	-0.41	0.16	0.27	0.48	-0.28	-0.12
Purchases_frequency	-0.16	0.20	0.79	0.43	0.79	-0.45	1.00	0.47	0.85	-0.46	-0.45	0.92	0.11	0.16	-0.10	0.29	0.09
Oneoff_purchases_frequency	0.10	0.14	0.69	0.95	0.19	-0.19	0.47	1.00	0.12	-0.18	-0.18	0.61	0.28	0.32	0.05	0.06	0.08
Purchases_installments_frequency	-0.16	0.15	0.61	0.13	0.92	-0.38	0.85	0.12	1.00	-0.38	-0.38	0.78	0.05	0.11	-0.09	0.26	0.11
Cash_advance_frequency	0.54	0.17	-0.40	-0.19	-0.37	0.94	-0.46	-0.18	-0.38	1.00	0.98	-0.41	0.09	0.20	0.46	-0.30	-0.13
Cash_advance_trx	0.55	0.17	-0.39	-0.18	-0.36	0.95	-0.45	-0.18	-0.38	0.98	1.00	-0.40	0.10	0.22	0.47	-0.30	-0.10
Purchases_trx	-0.06	0.19	0.89	0.60	0.78	-0.41	0.92	0.61	0.78	-0.41	-0.40	1.00	0.19	0.28	-0.03	0.25	0.16
Credit_limit	0.38	0.10	0.26	0.31	0.13	0.16	0.11	0.28	0.05	0.09	0.10	0.19	1.00	0.47	0.26	0.02	0.17
Payments	0.42	0.16	0.40	0.37	0.23	0.27	0.16	0.32	0.11	0.20	0.22	0.28	0.47	1.00	0.37	0.16	0.20
Minimum_payments	0.90	0.50	-0.01	0.07	-0.05	0.48	-0.10	0.05	-0.09	0.46	0.47	-0.03	0.26	0.37	1.00	-0.48	0.14
Prc_full_payment	-0.53	-0.22	0.23	0.04	0.27	-0.28	0.29	0.06	0.26	-0.30	-0.30	0.25	0.02	0.16	-0.48	1.00	0.01
Tenure	0.06	0.23	0.13	0.10	0.12	-0.12	0.09	0.08	0.11	-0.13	-0.10	0.16	0.17	0.20	0.14	0.01	1.00

Following are highly correlated variables with a correlation level of higher than 80%, and I will be removing these redundant variables from the dataset.

correlated_vars <- caret::findCorrelation(x = cor_cc,
                                          names = T,    # display detect variable names
                                          cutoff = 0.80)

correlated_vars %>% 
  kbl(col.names = "Redundant Variables") %>% 
  kable_styling(full_width = FALSE)

Redundant Variables
Purchases_trx
Purchases_frequency
Cash_advance
Cash_advance_frequency
Installments_purchases
Balance
Oneoff_purchases

Update the dataset by removing correlated variables.

cc <- cc %>% 
  select(-one_of(correlated_vars))

names(cc) %>% 
  kbl(col.names = "Active Variables") %>% 
  kable_styling(full_width = FALSE)

Active Variables
Balance_frequency
Purchases
Oneoff_purchases_frequency
Purchases_installments_frequency
Cash_advance_trx
Credit_limit
Payments
Minimum_payments
Prc_full_payment
Tenure

Now, the total number of variables from the dataset has been reduced from 17 to only 10. The 7 highly correlated variables have been removed.

6.3 PCA

In this section, I will use principal component analysis (PCA) to explore the data.

This first step (generally recommended) is to scale the data by standardisation so that variables with different units can be scaled down to a value between -1 to 1. This transformation will make all variables with various scales become comparable.

cc.scale <- scale(cc)

Showing the first 6 rows of the data:

head(cc.scale) %>% kbl() %>% kable_material_dark()

	Balance_frequency	Purchases	Oneoff_purchases_frequency	Purchases_installments_frequency	Cash_advance_trx	Credit_limit	Payments	Minimum_payments	Prc_full_payment	Tenure
C10001	-0.3700254	-0.4291590	-0.6862398	-0.7171374	-0.4794091	-0.9625197	-0.5439104	-0.3054899	-0.5376958	0.3551601
C10002	0.0676750	-0.4731808	-0.6862398	-0.9264679	0.0992522	0.6771649	0.7968061	0.0876836	0.2123677	0.3551601
C10003	0.5053754	-0.1164058	2.6464980	-0.9264679	-0.4794091	0.8138052	-0.3994801	-0.0999003	-0.5376958	0.3551601
C10005	0.5053754	-0.4657977	-0.4085128	-0.9264679	-0.4794091	-0.9078636	-0.3801428	-0.2611154	-0.5376958	0.3551601
C10006	0.5053754	0.1420539	-0.6862398	0.5388507	-0.4794091	-0.7438951	-0.1321118	0.6503258	-0.5376958	0.3551601
C10007	0.5053754	2.7989266	2.6464980	1.5855083	-0.4794091	2.4534898	1.5704929	-0.2808094	2.8375934	0.3551601

Applying PCA algorithm, and the Scree Plot implies that the first two components explain about 40.2% of variation and they will be used to construct the axes of PCA factor map.

res.pca <- PCA(cc.scale, graph = F)

fviz_screeplot(res.pca, addlabels = T, ylim = c(0, 30), barfill = "orange", barcolor = "black") +
  labs(title = "Scree Plot") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

The goal of PCA is to identify direction in the dataset where the first two dimensions explain the most variation.

pca1 <- fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)
pca2 <- fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)

pca3 <- fviz_pca_var(res.pca, repel = T,
             col.var = "contrib",
             gradient.cols = c("green", "green2", "darkgreen")) +
    labs(title = "Factor Map: Variable - PCA") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))


pg <- plot_grid(pca1, pca2)
plot_grid(pg, pca3, cols = 1)

The important variables in the datasets that explain many of the variation are purchases, cash_advance_trx, payments, and credit limit.

These variables are the closest variables to the circumference of the correlation circle which indicating that they have a high cosine-square level and are well represented by the first two principal components (Dim1 and Dim2). These variables also have high contribution level compared to other variables, which means they are important variables in contributing to the factor map.

PCA also reveals correlation trends among the variables in the dataset. Positively correlated variables are grouped together, for examples:

“Purchases” and “Oneoff_purchases_frequency”
“Credit_limit” and “Payments”
“Cash_advance_trx” and “Minimum_payments”
“Minimum_payments” and “Prc_full_payment” (Negative correlation)

It is possible that, during clustering, which is another branch of unsupervised learning, these variables can be important and help forming distinctive groups. PCA detection of these important variables quickly generate some insights, such as:

Would there be a group of credit card holders who love using credit cards and have made numerous purchases, or vice versa?
Would there be a group of credit card holders who love paying minimum payment instead of paying full amount of balance that they owe, or vice versa?
Would there be a group of credit card holders who use a lot of cash advance from the credit card company?

These will be investigated in the next section, clustering, by several type of clustering algorithms.

7 CLUSTERING

The dataset must meet several conditions prior to be clustered,

Having observation as row and variable as variable
No missing values in the dataset
Standardising the data (generally recommended) to make variables comparable. This step will transform the values in all variables to a scale that have 0 mean and 1 standard deviation.

All these steps have been met.

There are different type of clustering methods such as partitioning clustering which include “k-means clustering”, “PAM”, and “CLARA”, or hierarchical clustering which include the famous “AGNES” and “DIANA”.

Partitioning and hierarchical clustering methods are alternative to each other, and although automation with selection statistics are available to choose the best algorithm for the dataset but I will not perform it here as the data size is too large to handle within a reasonable time frame. However, proper steps of clustering will be performed along with several advanced clustering techniques.

7.1 Cluster-Tendency Assessment

A big issue with clustering techniques is that it will always return clusters even if there is no any cluster in the dataset. Therefore, clustering tendency assessment is an important step to test whether the dataset can be clustered.

Hopkins Statistics is a statistical method that using probability to test for spatial randomness of the data. The null hypothesis is that the credit card dataset is uniformly distributed, indicating no meaningful cluster.

If the value of hopkins statistic is low (0 - 3), it indicates regularly-spaced data or being indecisive (clustered or random).
If the value is around zero, it indicates random data
If the value is around 0.7 to 1, it is a significant evidence that the data might be clusterable.

set.seed(123)

hopkins(cc.scale, m = nrow(cc.scale)-1)

## [1] 0.9999995

The Hopkins Statistics is close to 1, and base on this value I can reject the null hypothesis and conclude that the credit card dataset is significantly a clusterable data.

Following is the same test from other R package and should yield similar result. The Hopkins statistic is also closed to 1 and I can conclude that the dataset is significantly clusterable.

set.seed(123)

get_clust_tendency(cc.scale, n = 100, graph = F)

## $hopkins_stat
## [1] 0.9663011
## 
## $plot
## NULL

7.2 CLARA

I have picked CLARA as the clustering algorithm to perform for this dataset because this dataset has more than thousand of observations (Alboukadel Kassambara 2017). CLARA stands for “Clustering Large Application”, which is an extension of PAM (Partitioning Around Medoid), and PAM is a robust alternative to the K-mean clustering because it is less sensitive to outlier.

As the name of CLARA suggests, I apply CALARA because this dataset is a large dataset and CLARA can help to reduce computation time and RAM storage issue by a sampling approach. I did not pick hierarchical clustering (AGNES and DIANA) for this project because of the same reason, CLARA is designed for large application.

Following perform 6 CLARA algorithms for 6 different number of K for exploration purposes.

res.clara2 <- eclust(cc.scale, 
                    FUNcluster = "clara",
                    k = 2,
                    graph = F)

res.clara3 <- eclust(cc.scale, 
                    FUNcluster = "clara",
                    k = 3,
                    graph = F)

res.clara4 <- eclust(cc.scale, 
                    FUNcluster = "clara",
                    k = 4,
                    graph = F)

res.clara5 <- eclust(cc.scale, 
                    FUNcluster = "clara",
                    k = 5,
                    graph = F)

res.clara6 <- eclust(cc.scale, 
                    FUNcluster = "clara",
                    k = 6,
                    graph = F)

res.clara7 <- eclust(cc.scale, 
                    FUNcluster = "clara",
                    k = 7,
                    graph = F)


f1 <- fviz_cluster(res.clara2, show.clust.cent = F, geom = "point", alpha = 0.5, palette = "jco") + 
  theme_classic() + labs(title = "Cluster plot (K = 2)")

f2 <- fviz_cluster(res.clara3, show.clust.cent = F, geom = "point", alpha = 0.5, palette = "jco") + 
  theme_classic() + labs(title = "Cluster plot (K = 3)")

f3 <- fviz_cluster(res.clara4, show.clust.cent = F, geom = "point", alpha = 0.5, palette = "jco") + 
  theme_classic() + labs(title = "Cluster plot (K = 4)")

f4 <- fviz_cluster(res.clara5, show.clust.cent = F, geom = "point", alpha = 0.5, palette = "jco") + 
  theme_classic() + labs(title = "Cluster plot (K = 5)")

f5 <- fviz_cluster(res.clara6, show.clust.cent = F, geom = "point", alpha = 0.5, palette = "jco") + 
  theme_classic() + labs(title = "Cluster plot (K = 6)")

f6 <- fviz_cluster(res.clara7, show.clust.cent = F, geom = "point", alpha = 0.5, palette = "jco") + 
  theme_classic() + labs(title = "Cluster plot (K = 7)")

plot_grid(f1, f2, f3, f4, f5, f6,
          ncol = 3,
          nrow = 2)

From the above visualisation, I can see how the data partitions change (or added) when K is increased. Following shows the silhouette width of each observation in each k.

The line in the middle of the graph is the overall average silhouette width of each k, the higher silhouette width, the better the quality of a clustering or an observation.

In following plots, it can be interpreted this way (Alboukadel Kassambara 2017):

Observations with a large Silhouette value are very well clustered
Observations with a small silhouette value means they lies between two clusters
Observations with a negative Silhouette are probably assgined to the wrong cluster. However, negative values may be unavoidable.

f7 <- fviz_silhouette(res.clara2, palette = "jco")

##   cluster size ave.sil.width
## 1       1 6897          0.23
## 2       2 1739          0.14

f8 <- fviz_silhouette(res.clara3, palette = "jco")

##   cluster size ave.sil.width
## 1       1 3978          0.31
## 2       2 3462          0.04
## 3       3 1196          0.25

f9 <- fviz_silhouette(res.clara4, palette = "jco")

##   cluster size ave.sil.width
## 1       1 4267          0.30
## 2       2 2574          0.07
## 3       3 1065          0.27
## 4       4  730          0.23

f10 <- fviz_silhouette(res.clara5, palette = "jco")

##   cluster size ave.sil.width
## 1       1 3699          0.29
## 2       2 1737          0.06
## 3       3 1868          0.22
## 4       4 1176          0.22
## 5       5  156         -0.13

f11 <- fviz_silhouette(res.clara6, palette = "jco")

##   cluster size ave.sil.width
## 1       1 2054          0.45
## 2       2 1842         -0.09
## 3       3 1118          0.10
## 4       4 1881          0.18
## 5       5 1660          0.04
## 6       6   81         -0.09

f12 <- fviz_silhouette(res.clara7, palette = "jco")

##   cluster size ave.sil.width
## 1       1 2896          0.37
## 2       2 1885          0.18
## 3       3  638          0.02
## 4       4 1254          0.17
## 5       5 1141         -0.10
## 6       6   95          0.03
## 7       7  727          0.22

plot_grid(f7, f8, f9, f10, f11, f12,
          ncol = 3,
          nrow = 2)

Above graph can be used as a reference when interpreting a result from a suggested optimal K.

Instead of choosing the clustering manually, there are methods to suggest the best K. Now I will use two typical methods among other popular methods to find the optimal number of clusters (specified by “k”). All methods should recommend the same or highly similar number of optimal k. In general, the optimal k is specified by the analyst of the project, and it can be hard task. There is also no definite answer to say what k is the best, it is subjective and relying on the selection of the distance matrix and the parameters used for clustering the data (Alboukadel Kassambara 2017).

Elbow Method:

Following shows the result of Elbow method. In Elbow method, clustering algorithm is computed for different number of clusters (k). For each k, the total within sum of square (WSS) is calculated. Total WSS can be understood as total within-cluster variation, and we want it to be minimised.

For each unit of K addition, the WSS is decreased, however the best K is located when the reduction rate of total WSS starts to slow down significantly. It is known as the location of a bend (knee), the location is generally considered as an indicator of appropriate k.

Therefore, in following Total WSS graph, the optimal K can be either 5 or 6.

fviz_nbclust(cc.scale, FUNcluster = clara, method = "wss")

Silhouette Method

This method uses average silhouette width to determine the optimal number of k. Average silhouette width is the y-axis in following graph, and which can be understood as an averaged “similarity” value of each point within a cluster.

In each cluster, we wish observation points within a cluster are similar to each other and by that way, they would have high silhouette width. An average silhouette width is the combination of silhouette value of each points in a cluster. Therefore, a high average silhouette width also means good clustering.

The silhouette method of this section suggests 4 as the optimal number of K for clustering because this k has the highest average silhouette width.

fviz_nbclust(cc.scale, FUNcluster = clara, method = "silhouette")

Analysing the result

Therefore, what is the best K?

Elbow method suggests 5 and 6, and
Average Silhouette width method suggests 4 and 5 (Average silhouette width of 5 is not too far from 4 actually)

Therefore, I will be taking 5 as the optimal number of cluster for CLARA because it is the value suggested by two methods.

Following code chunk adds the clustering results to the scaled original data.

# set up dataframe
clara_k6 <- cbind(as.data.frame(cc.scale), 
                  cluster = res.clara5$clustering) %>%  # merge scaled data with clusters
  mutate(cluster = as.factor(cluster)) %>% 
  relocate(cluster, .before = Balance_frequency)


# plot
ggRadar(clara_k6, aes(group = cluster),
        size = 1, 
        use.label = F,  
        rescale = F) +        
  # Use scaled data to plot Radar, do not normalise btw 1-0 by rescale = T.
  # scaled data better show subtle trends
  theme_bw() +
  labs(title = "Radar plot") + 
  facet_wrap(~cluster) +
  theme(legend.position = "none",
        axis.text.x = element_text(size = 7),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Following show the median of each variable in each cluster. I plot them to aid interpretation of Radar plot above if required.

set.seed(123)

clara_k6_median <- clara_k6 %>% 
                   as.data.frame() %>% 
                   group_by(cluster) %>% 
                   summarise_all(median)

ggparcoord(clara_k6_median, 
           column = c(2:11),
           groupColumn = "cluster",
           scale = "globalminmax", 
           showPoints = T) +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.position = "top") +
  labs(title = "Parallel Coordinate Plot",
       subtitle = "Showing: Median of each Variable") +
  geom_text_repel(aes(label = cluster), box.padding = 0.1)

Five groups of credit card users have been identified:

Cluster 1: Infrequent User, the credit card users in this cluster do not like to use credit card that much, they do use occasionally, however they are less active compared to users in other clusters.

Cluster 2: One-off Purchasers, credit card users in this group prefer to use credit card for one-off purchase and less prefer to use it to make installment purchase.

Cluster 3: Installment Purchasers, credit card users in this group love to use credit card to make installment purchases.

Cluster 4: Max Payers, users from this group have zero level “Balance_frequency” which is a variable detecting their activities with the credit card company. Users in this group also has a high level of percentage of full-payment. They prefer to pay money they owe in full. It may explains why they have low level of “Balance frequency”.

Cluster 5: Active Revolvers, users from this group are characterised by very high level of credit card activities in term of dollars scale. The variables “Purchases” and “Payments” were in dollars unit. Users in this group spend bigger amount of money in their purchases, they have also the highest level of cash in advance transaction, they make repayment back in big amount, and they have a low level of full payment (percentage). It seems like they do not like to pay back in full.

7.3 Hierarchical K-Means Clustering

In this section, I will perform hkmeans, which is a method that combining K-means clustering and hierarchical clustering. It is a technique that can improve k-means results.

The biggest limitation of K-means clustering is that the final result is subjected to the initial random selection of centroids. However, in hkmeans, the selection of initial centroids is not based on randomisation but the result of hierarchical clustering from the hk-mean algorithm. These centroids can be understood as optimiased centroids. Therefore, in hierarchical k-means clustering, the hierarchial clustering is being performed first, then followed by the k-means clustering.

Hierarchical K-Means Clustering algorithms :

Step 1: Compute hierarchical clustering with the optimal K obtained from previous section (e.g., based on “WSS” of K-means). In this step, distance matrix between each pair of observation is computed and based on that, a linkage function is selected to group similar objects into hierarchical cluster tree. Each observation are considered a cluster on its on, then the forming of clustering move upward and form bigger cluster, and this step continues until the root is reached. Finally, the tree is cut into pre-specified K (which is 7 in this case).
Step 2: Compute the center (i.e. mean) of each cluster. Therefore, there are 7 new center here in each of the 7 clusters.
Step 3: Compute K-means clustering on the original standardised credit card dataset by using the set of cluster centers computed in step 2 as the initial cluster centers. Then, according to the algorithm of K-means clustering, each observations will be assigned to the closest centroid based on distance matrix (eg. euclidean). For each k-clusters, a new mean is being updated again by calculating the new mean value of all data points in the cluster. Finally, all observations are being re-allocated to these new means. The allocation of observations and the calculation of new means are iterated until the total within sum of square (WSS) is minimised.

Optimal K Search

Following silhouette method suggests 2 as the optimal K.

fviz_nbclust(cc.scale, 
             FUNcluster = kmeans, 
             method = "silhouette")

Following kmean-wss Elbow method also suggests the same, The rate of WSS reduction become slower in second K.

fviz_nbclust(cc.scale, FUNcluster = kmeans, method = "wss")

From above result of WSS and silhouette analysis, I will use 2 as the optimal number of K for clustering.

Following code completes the hkmeans algorithm.

# Create hierarchical k-means clustering

res.hk <- hkmeans(cc.scale, 
                 k = 2,
                 hc.metric = "manhattan",
                 hc.method = "average",   # Single or Average linkage function is recommended as they are less sensitive to the skewness caused by outliers in this credit card dataset. 
                 km.algorithm = "Hartigan-Wong")

The “res.hk” object contains following results.

# Items created from the object

names(res.hk)

##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

These 7 clusters have following size:

res.hk$size

## [1] 6926 1710

Graphical presentation of the 6 clusters:

hk.cluster <- fviz_cluster(res.hk,
                           palette = "jco",
                           geom = "point",
                           alpha = 0.5) 
  
hk.cluster + 
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5)) +
  labs(title = "Hierarchical K-means Clustering Results",
       subtitle = "Hartigan-Wong algorithm + Euclidean + ward.D2 + K = 6")

Plotting Radar plot to see how distinctive each of the 2 clusters.

hk_df <- cbind(as.data.frame(cc.scale),
               cluster = res.hk$cluster) %>% 
  mutate(cluster = as.factor(cluster)) %>% 
  relocate(cluster, .before = Balance_frequency)

ggRadar(hk_df, aes(group = cluster),
        size = 1, 
        rescale = F,
        use.label = F) + 
  theme_bw() +
  labs(title = "Radar plot") +
  facet_wrap(~cluster)

Two clusters identified from the entire dataset have following sample size:

table(res.hk$cluster)

## 
##    1    2 
## 6926 1710

set.seed(123)

hk_df_median <- hk_df %>% 
                   as.data.frame() %>% 
                   group_by(cluster) %>% 
                   summarise_all(median)

ggparcoord(hk_df_median, 
           column = c(2:11),
           groupColumn = "cluster",
           scale = "globalminmax", 
           showPoints = T) +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.position = "top") +
  labs(title = "Parallel Coordinate Plot",
       subtitle = "Showing: Median of each Variable") +
  geom_text_repel(aes(label = cluster), box.padding = 0.1)

Insights:

Cluster 1: Less-active user. Most of the credit card users from this cluster use their credit card in an very infrequency manner.

Cluster 2: Active-user. These credit card users are actively using their credit card to make purchases. One off purchases is more popular than installment purchases.

7.4 Fuzzy Clustering

This section performs an alternative to k-mean clustering. In K-mean clustering or PAM, observations are assigned to exactly 1 cluster. However, in fuzzy clustering, each observation has a probability of belong to each cluster. Points that close to the center of a cluster will have a higher probability than points that further away. It is a technique usually used for marketing in determining preference and shopping patterns of customers in a more realistic way.

Computing fuzzy algorithm:

res.fanny <- fanny(cc.scale, k = 2, 
                   metric = "euclidean", 
                   stand = FALSE, 
                   memb.exp = 1.05)

Showing the member coefficient

head(res.fanny$membership, 10)

##               [,1]         [,2]
## C10001 0.999999914 8.648864e-08
## C10002 0.999820051 1.799489e-04
## C10003 0.093730995 9.062690e-01
## C10005 0.999999984 1.638830e-08
## C10006 0.761181762 2.388182e-01
## C10007 0.000172510 9.998275e-01
## C10008 0.001644969 9.983550e-01
## C10009 0.999634366 3.656337e-04
## C10010 0.971929245 2.807076e-02
## C10011 0.002505073 9.974949e-01

Following code show the cluster that each observation belongs to (extracting the first 10 observations).

head(res.fanny$clustering, 10)

## C10001 C10002 C10003 C10005 C10006 C10007 C10008 C10009 C10010 C10011 
##      1      1      2      1      1      2      2      1      1      2

Visualising the clusters:

fviz_cluster(res.fanny, geom = "point", repel = T,
             palette = "jco")

Silhouette width (Si) is a type of interval cluster validation, with a range from 1 to -1. If an observation has a Si of closer than 1 then the point is well clustered, whereas if a point is close to -1 than it indicates that the point is poorly clustered and assigning this point to other cluster may improve the results. It can be understood that if a point has negative Si, it means that they are not in the correct cluster (Alboukadel Kassambara 2017).

fviz_silhouette(res.fanny, palette = "jco")

##   cluster size ave.sil.width
## 1       1 4678          0.36
## 2       2 3958         -0.03

According to the above silhouette plot that is available for Fuzzy algorithm, there are a lot of noises in figure 2 and 3, and generally all 7 clusters have a relatively low silhouette level which is a statistic used to measure clustering quality. Therefore, results from following graph is recommended to be digested with caution.

# Radar plot 

fuz_df <- cbind(as.data.frame(cc.scale), 
                  cluster = res.fanny$clustering) %>%  # merge scaled data with clusters
  mutate(cluster = as.factor(cluster)) %>% 
  relocate(cluster, .before = Balance_frequency)


ggRadar(fuz_df, aes(group = cluster),
        size = 1,
        use.label = T, 
        rescale = F) +        
  # Use scaled data to plot Radar, do not normalise btw 1-0 by rescale = T.
  # scaled data better show subtle trends
  theme_bw() +
  labs(title = "Radar plot") +
  facet_wrap(~ cluster)

set.seed(123)

fuz_df_median <- fuz_df %>% 
                   as.data.frame() %>% 
                   group_by(cluster) %>% 
                   summarise_all(median)

ggparcoord(fuz_df_median, 
           column = c(2:11),
           groupColumn = "cluster",
           scale = "globalminmax", 
           showPoints = T) +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.position = "top") +
  labs(title = "Parallel Coordinate Plot",
       subtitle = "Showing: Median of each Variable") +
  geom_text_repel(aes(label = cluster), box.padding = 0.1)

Fuzzy clustering with 2 clusters have the similar characteristics as Hierarchical K-means clustering, which is having 2 clusters that group users into a group of less-active user and a group of active users. Some differences appear are:

Cluster 1: The preference of using cash in advance is more obvious than the cluster 1 in HKmeans in previous section.

Cluster 2: Fuzzy clustering indicates that installment-purchases is more popular than one-off purchases. However, though both are falling the same category of “purchases”. Important similarity is still preserved, users in this group is also prefer to make full repayment, and less prefer for minimum payments.

7.5 Model-Based Clustering

Model-based clustering also computes probabilities for each observation. However, this model-based clustering will suggest us the best k.

Algorithm used in this model-based clustering is called Expectation-Maximization algorithm (EM). In this algorithm, data are considered as coming from a mixture of two or more clusters, and all clusters may have a mixture of density.

The algorithm starts by hierarchical model-based clustering, and cut the dendrogram for different number of k. In each k, each cluster is centered at the means with increase density for points near the mean.

The geometric features (volume, shape, and orientation) of each cluster will then be determined. After this step, the similarity of volume, shape and orientation between clusters of that k will be determine using a standard parameterasation.

Fitting the model-based clustering:

res.mc <- Mclust(cc.scale)

The best model is determined using the Bayesian Information Criterion or BIC. A large BIC indicates good model, and the model with highest BIC will be selected.

The model-based clustering selected a model with 8 clusters (k), the optimal selected model name is VEV, it means clusters have varying volume, equal shape, and varying orientation on the coordinate axes.

summary(res.mc)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 7 components: 
## 
##  log-likelihood    n  df      BIC       ICL
##       -23317.69 8636 407 -50324.3 -50792.31
## 
## Clustering table:
##    1    2    3    4    5    6    7 
## 2267 1312  385 1071  357 1356 1888

Visualising the outcome of cluster-based modeling:

mc1 <-  fviz_mclust(res.mc, 
            what = "classification",
            geom = "point", 
            alpha = 0.1)

mc2 <- fviz_mclust(res.mc, what = "BIC")

mc3 <- fviz_mclust(res.mc, what = "uncertainty") + labs(subtitle = "Larger symbols indicate the more uncertain observations.")

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

pg1 <- plot_grid(mc1, mc2)

plot_grid(pg1, mc3,
          nrow = 2)

Visualising the clustering result using Radar plot:

# df

mc_df <- cbind(as.data.frame(cc.scale), 
                  cluster = res.mc$classification) %>%  # merge scaled data with clusters
  mutate(cluster = as.factor(cluster)) %>% 
  relocate(cluster, .before = Balance_frequency)

# Radar plot 

ggRadar(mc_df, aes(group = cluster),
        size = 1,
        use.label = T, 
        rescale = F) +        
  # Use scaled data to plot Radar, do not normalise btw 1-0 by rescale = T.
  # scaled data better show subtle trends
  theme_bw() +
  labs(title = "Radar plot") +
  facet_wrap(~cluster) +
  theme(plot.margin = unit(c(0.1, 0.1, 0.1, 0.1), "mm"))

Alternatively, a parallel coordinate plot may help:

set.seed(111)

mc_df_median <- mc_df %>% 
                   as.data.frame() %>% 
                   group_by(cluster) %>% 
                   summarise_all(median)

ggparcoord(mc_df_median, 
           column = c(2:11),
           groupColumn = "cluster",
           scale = "globalminmax", 
           showPoints = T) +
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        legend.position = "top") +
  labs(title = "Parallel Coordinate Plot",
       subtitle = "Showing: Median of each Variable") +
  geom_text_repel(aes(label = cluster), box.padding = 0.1)

Model-based clustering identified 4 groups of cluster with following characterisation:

Based on the median of each cluster:

Cluster 1: Less active user. They don’t like to use credit card that much, but slightly more prefer to use cash-in-advance.

Cluster 2: Less active user. They don’t like to use credit card that much, but slightly more prefer to use credit card to make purchases, especially installment purchases.

Cluster 3: Revolvers, they prefer to make minimum payment back to credit card company, they prefer to make expensive purchases and therefore they have higher credit limit and make bigger payment back to credit card company, but they don’t like to make full-payment back.

Cluster 4: Less active user, but they prefer to make full payment and they very low tenure.

Cluster 5: Active card users, they buy expensive products and make big payment back to credit card company. Sometime they prefer full payment, sometime they pay minimum payments.

Cluster 6: Max Payers, but they are less active with zero level of tenure.

Cluster 7: Revolvers, they less active revolvers who prefer to make minimum payment. They spend small money for purchases on cheaper products.

Cluster 8: Max Payer, but they are active with high level of tenure.

7.6 DBSCAN

DBSCAN stands for Density-Based Spatial Clustering for Application with Noise. It is a method that we can easily detect and extract outliers, and cluster the observations based on their main members. To compute DBSCAN algorithm, we need to specify epsilon (eps) and minimum points (MinPts).

For best “MinPts”, it goes with the rules that it should be large if the dataset is large, however it must be chosen at least 3. I will work with different value of MinPts to see their effects on cluster configurations.
For best “eps”, it can also be tried out with different values, however, I will go with the KNN method. This method will compute the K-nearest neighbor distances in a matrix of points (Alboukadel Kassambara 2017). The k specified in this case is not the optimal number of clustering but the number of k-nearest neighbor, and which corresponds to MinPts. I will plot the relevant graph and look at the “knee” (where a sharp change is about to happen), and where is the optimal value of epsilon (eps).

In my DBSCAN algorithm, I will be trying out different MinPt parameter from 3 to 8. Therefore, I computed different KNN distance plot below for respective MinPt to search for the best epsilon (eps) for each of them.

set.seed(123)
par(mfrow = c(2,3))

# 
dbscan::kNNdistplot(cc.scale, k = 3)     # K = MinPt, K here is KNN'K
abline(h = 1.7, lty = 2)
text(x = 2000, y = 2.1, "1.7", col = "blue")
mtext("1.7 is the optimal eps for 3 MinPt ")

dbscan::kNNdistplot(cc.scale, k = 4)     # K = MinPt, K here is KNN'K
abline(h = 1.8, lty = 2)
text(x = 2000, y = 2.3, "1.8", col = "blue")
mtext("1.8 is the optimal eps for 4 MinPt ")

dbscan::kNNdistplot(cc.scale, k = 5)     # K = MinPt, K here is KNN'K
abline(h = 2, lty = 2)
text(x = 2000, y = 2.5, "2", col = "blue")
mtext("2 is the optimal eps for 5 MinPt ")

#
dbscan::kNNdistplot(cc.scale, k = 6)     # K = MinPt, K here is KNN'K
abline(h = 2, lty = 2)
text(x = 2000, y = 2.5, "2", col = "blue")
mtext("2 is the optimal eps for 6 MinPt ")

dbscan::kNNdistplot(cc.scale, k = 7)     # K = MinPt, K here is KNN'K
abline(h = 2.2, lty = 2)
text(x = 2000, y = 2.8, "2.2", col = "blue")
mtext("2.2 is the optimal eps for 7 MinPt ")

dbscan::kNNdistplot(cc.scale, k = 8)     # K = MinPt, K here is KNN'K
abline(h = 2.2, lty = 2)
text(x = 2000, y = 2.8, "2.2", col = "blue")
mtext("2.2 is the optimal eps for 8 MinPt ")

Applying above result and now start to compute DBSCAN algorithm for 6 different MinPts.

set.seed(123)

dbsscan.res.eps1 <- fpc::dbscan(cc.scale, 
                               eps = 1.7, 
                               MinPts = 3)

dbsscan.res.eps2 <- fpc::dbscan(cc.scale, 
                               eps = 1.8, 
                               MinPts = 4)

dbsscan.res.eps3 <- fpc::dbscan(cc.scale, 
                               eps = 2, 
                               MinPts = 5)

dbsscan.res.eps4 <- fpc::dbscan(cc.scale, 
                               eps = 2, 
                               MinPts = 6)

dbsscan.res.eps5 <- fpc::dbscan(cc.scale, 
                               eps = 2.2, 
                               MinPts = 7)

dbsscan.res.eps6 <- fpc::dbscan(cc.scale, 
                               eps = 2.2, 
                               MinPts = 8)


g1 <- fviz_cluster(dbsscan.res.eps1,
             data = cc.scale,
             stand = F) + labs(title = "MinPts = 3 with eps at 1.7")
g2 <- fviz_cluster(dbsscan.res.eps2,
             data = cc.scale,
             stand = F) + labs(title = "MinPts = 4 with eps at 1.8")
g3 <- fviz_cluster(dbsscan.res.eps3,
             data = cc.scale,
             stand = F) + labs(title = "MinPts = 5 with eps at 2")
g4 <- fviz_cluster(dbsscan.res.eps4,
             data = cc.scale,
             stand = F) + labs(title = "MinPts = 6 with eps at 2")
g5 <- fviz_cluster(dbsscan.res.eps5,
             data = cc.scale,
             stand = F) + labs(title = "MinPts = 7 with eps at 2.2")
g6 <- fviz_cluster(dbsscan.res.eps6,
             data = cc.scale,
             stand = F) + labs(title = "MinPts = 8 with eps at 2.2")


plot_grid(g1, g2, g3, g4, g5, g6,
          nrow = 3,
          ncol = 2)

From the result above, it is observed that DBSCAN algorithm may not suitable for this dataset. DBSCAN clustering works will when different distinct clusters in the dataset have similar density of points, however, all data points are too close to each other and form a high, big dense area and which may have failed the DBSCAN algorithm.

8 CONCLUSION

Comparing the results from CLARA, H-Kmeans, Fuzzy, model-based, and DBSCAN clustering. The VEV model from model-based clustering is able to detect 8 k, which is the largest amount of distinct clusters compared to other clustering algorithms.

Cluster 1: Less active user that prefer cash-in-advance.
Cluster 2: Less active user that prefer to use credit card to make purchases, especially installment purchases.
Cluster 3: Revolvers who prefer to make expensive purchases.
Cluster 4: Less active users who prefer to make full repayment back to credit card company.
Cluster 5: Active card users who make expensive purchases and make repayment in big amount, sometime they prefer full payment, sometime they pay minimum-payments.
Cluster 6: Max payers who prefer to pay money owe in full with zero tenure.
Cluster 7: Less active revolvers, they spend small amount of money to purchase cheaper products.
Cluster 8: Max Payer but this users group is more active in making purchases than cluster 6.

If smaller amount of clusters is interested, the next best algorithm will be CLARA which suggests 5 groups of different credit card users, followed by 2 groups suggested by HKmeans and Fuzzy.

Thank you for Reading

9 REFERENCE

Arjun Bhasin 2018, Credit card Dataset for Clustering, viewed 10 May 2022, https://www.kaggle.com/datasets/arjunbhasin2013/ccdata

Alboukadel Kassambara 2017, Practical Guide to Cluster Analysis in R, Multivariate Analysis 1, Edition 1, sthda.com

Clustering and dimensionality reduction techniques on the Berlin Airbnb data and the problem of mixed data (n.d.),viewed 15 May 2022 https://rstudio-pubs-static.s3.amazonaws.com/579984_6b9efbf84ee24f00985c29e24265d2ba.html

Jenny Listman 2019, Customer Segmentation / K Means Clustering, viewed 14 May 2022, https://www.kaggle.com/code/jennylistman/customer-segmentation-k-means-clustering

Matt.0 2019, 10 Tips for Choosing the Optimal Number of Clusters, viewed 12 May 2022, https://towardsdatascience.com/10-tips-for-choosing-the-optimal-number-of-clusters-277e93d72d92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cc.md

cc.md

Credit Card Clustering for Market Segmentation

1 SUMMARY

2 R PACKAGES

3 INTRODUCTION

4 DATA PREPARATION

4.1 Data Import

4.2 Data Description

4.3 Data Exploration

5 DATA CLEANING AND MANIPULATION

5.1 Rename all variables

5.2 NA Removal

6 EDA

6.1 Histogram

6.2 Correlation Check

6.3 PCA

7 CLUSTERING

7.1 Cluster-Tendency Assessment

7.2 CLARA

7.3 Hierarchical K-Means Clustering

7.4 Fuzzy Clustering

7.5 Model-Based Clustering

7.6 DBSCAN

8 CONCLUSION

9 REFERENCE

Files

cc.md

Latest commit

History

cc.md

File metadata and controls

Credit Card Clustering for Market Segmentation

1 SUMMARY

2 R PACKAGES

3 INTRODUCTION

4 DATA PREPARATION

4.1 Data Import

4.2 Data Description

4.3 Data Exploration

5 DATA CLEANING AND MANIPULATION

5.1 Rename all variables

5.2 NA Removal

6 EDA

6.1 Histogram

6.2 Correlation Check

6.3 PCA

7 CLUSTERING

7.1 Cluster-Tendency Assessment

7.2 CLARA

7.3 Hierarchical K-Means Clustering

7.4 Fuzzy Clustering

7.5 Model-Based Clustering

7.6 DBSCAN

8 CONCLUSION

9 REFERENCE