Predicting Credit Card Defaults with Machine Learning

Published in

The Startup

6 min readFeb 26, 2021

According to Federal Reserve Economic Data, credit card delinquency rates have been increasing since 2016 (sharp decrease in Q1 2020 is due to COVID relief measures).

Delinquency Rate on Credit Card Loans, All Commercial Banks | FRED | St. Louis Fed

Units: Frequency: Suggested Citation: Board of Governors of the Federal Reserve System (US), Delinquency Rate on Credit…

fred.stlouisfed.org

The bank performs a charge-off on delinquent credit cards and eats the losses. If only there was a way to predict which customers had the highest probability of defaulting so it may be prevented…

Problem

Can we reliably predict who has is likely to default? If so, the bank may be able to prevent the loss by providing the customer with alternative options (such as forbearance or debt consolidation, etc.). I will use various machine learning classification techniques to perform my analysis.

Data

Source, classic dataset from UC Irvine’s machine learning repository:

https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Target: Did the customer default? (Yes=1/Positive, No=0/Negative)

Features:

Credit Limit: Amount of the given credit (in dollars): it includes both the individual consumer credit and his/her family (supplementary) credit
Sex (1=male; 2=female)
Education (1=graduate school; 2=university; 3=high school; 4=other)
Marital Status (1=married; 2=single; 3=others)
Age (years)
History of past payment: The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above
Amount of bill statement (dollars) for past 6 months
Amount of previous payment for the past 6 months

Methodology

Exploratory Data Analysis
Baseline Model
Performance Metrics
Optimization
Feature Importance
Hyperparameter Tuning
Class Imbalance
Analyze Results

Exploratory Data Analysis

The following are some key findings in the data. Shout to Gabriel Preda from Kaggle for the awesome visualization ideas. Check out his work here:

https://www.kaggle.com/gpreda

Distribution of target classes is highly imbalanced, non-defaults far outnumber defaults. This is common in these datasets since most people pay credit cards on time (assuming there isn’t an economic crisis).

Payment status. Correlation strength increases the closer the months are in time. Makes sense. For example, one could assume a late payment in August would likely lead to a late payment in September. However, it is less clear we can make the same assumption for April and September

Distribution of credit limit amounts. The three largest credit limit amount groups are $50k, $20k, and $30k, respectively.

Credit Limit by Sex. The data is evenly distributed amongst males and females.

Marriage, age, and sex. The dataset mostly contains couples in their mid-30s to mid-40s and single people in their mid-20s to early-30s.

Baseline Model

Now comes the fun part. Let’s start the prediction process by establishing a baseline model that we can build on.

Prepare features and target:

Scale the data so the model can easily digest information.

Score several models and choose one to improve upon

Support Vector Machine (SVM) performs the best with an accuracy of 0.8193. However, we will move forward with Random Forest because it is just as robust plus it’s less computationally expensive.

Optimization

Before moving onto performance metrics, let’s discuss optimization. What metric exactly are we optimizing? In this case, we are optimizing recall. Recall is a performance metric which attempts to answer the question: What proportion of actual positives was identified correctly? Mathematically, the formula is:

TP = True Positive, or a correctly predicted default

FN = False Negative, or an incorrectly predicted non-default

Ideally, we do not want to allow any defaults to fall through the cracks, so our optimal model will minimize False Negatives (So RecallScore is as high as possible).

For a complete discussion of Precision vs Recall scores, check out this article created by the awesome developers at Google:

https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall

Performance Metrics

Confusion Matrix

The number we want to minimize is in the bottom-left quadrant. This number represents those customers we predicted would NOT default, but they in fact DID default. We need this number to be as small as possible.

Recall of 0.95 is very good. So we’re off to a great start. Let’s see what improvements we can make.

Feature Selection

There are many feature selection scores one could perform to determine
which features are most useful. For this case, we will use Feature Importance.

Generally speaking, Feature Importance is the process of assigning scores to each feature, ranking its usefulness in predicting the target variable.

The top features are shown in the chart above. It’s interesting how ‘age’ is the second most important feature. Let’s keep all of them except the last 6, the categorical variables. If we remove those features which are least important and keep the most important ones, this might allow us to better predict our target variable.

Hyperparameter Tuning

Let’s search for optimal parameters for our model

Class Imbalance

In the Exploratory Data Analysis section, we established that the target is very imbalanced. “Imbalanced classifications pose a challenge for predictive modeling as most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class” (Jason Brownlee, Machine Learning Mastery).

There are many ways we can remedy this. For my analysis, I will use random undersampling and oversampling. Random undersampling essentially deletes data from the negative class so the target distribution is even. Random oversampling duplicates information from the positive class so the target distribution is even. We will try both of these and see which one is more effective.

Let’s see how these models perform

Random Undersampling

RF recall score: 0.79

Random Oversampling

RF recall score: 0.79

Both recall scores are 0.79. Recall score went down nearly 0.16 from our baseline model. Interesting…

Analyze Results

Sometimes the best model is the simplest. The model with minimal manipulation yielded the highest recall score of 0.95. After feature selection and hyperparameter tuning, recall decreased to 0.79.

Let’s check for overfitting. Overfitting means the model is strong at predicting the data on which it was trained, but weak at generalizing to unseen data. Lets test the model on never-before-seen data points and see how it performs

RF validation score: 0.8208, test score: 0.8217

The validation score is similar to the test score, so we know it’s performing similarly on completely unseen. Therefore, we can conclude there is no overfitting going on.

ROC Curve

The area under the ROC curve tells us how well the model separates the different classes in the dataset. It plots true positive rate against false positive rate

We’re calculating the area between the blue curved line and pink dotted line. This area is a number between 0 and 1, zero meaning the model predicted all of the data incorrectly, and one meaning the model predicted all of the data correctly. Our model is pretty good at 0.7747.

At the end of the day, having the ability to predict 95% (recall score) of potential defaults would save a-lot of money on credit card charge-offs. Obviously, real-world application is more nuanced, but this modeling process is a step in the right direction.

For full code on this project and others, please visit my Github Repository

mdominguez2010/Classification_Credit_Cards

Github

Social Media

Marcos Dominguez - San Francisco, California, United States | Professional Profile | LinkedIn

I am a Data Scientist with a background in banking and lending

www.link

Contact me

email: md.ghsd@gmail.com