Money

Lending Club is an American participative finance company based in San Fransisco. It was founded by Renaud Laplanche in 2006. It is a platform of loan between individuals which allows the borrowers to have access to credits at rates lower than those proposed by the traditional banks and allows the investors to invest their savings in the form of a loan to a community that verifies certain conditions (risk associated with the borrower, social or solidarity dimension, etc.).

For a bank or institute that provides credit, it is crucial to know the clients and their ability to pay back. Traditional credit scoring techniques already exist and take into account the different characteristics of a client, however, they can’t adapt to a new environment easily unlike machine learning techniques that can detect unseen relationships and behaviors.

The model predictions can be expressed in the form of solvency probability or in the form of a binary classification (Payer/Delinquent).

To conduct our problem resolution, we will use a partial dataset from the Lending Club competition on Kaggle.

Dataset exploration

NameTypeMissing valuesMeaning
IdNumerical0Id of the client
Loan_amntNumerical0Loan amount
Emp_lengthCategorical0Employment length
Home_ownershipCategorical0Home ownership
Zip_codeCategorical0Zip code in the US
Annual_incNumerical0Annual income
DtiNumerical0Debt to Income
Delinq_2yrsNumerical0Number of delinquencies the past two years
Inq_last_6mthsNumerical0Number of inquiries the last 6 months
Open_accNumerical0Number of open transactions the past 6 months
Pub_recNumerical0Number of derogatory public records
Revol_balNumerical0Balance of revolving credit
Revol_utilNumerical93The rate of use of renewable lines
Total_accNumerical0The total number of lines of credit currently in the borrower’s credit file
Collections_12_mths_ex_medNumerical13Number of collections in 12 months excluding medical collections
Acc_now_delinqNumerical0The number of accounts on which the borrower is now late
Tot_coll_amtNumerical22361Total amount due
Tot_cur_balNumerical22361Total balance of the account
Toal_rev_hi_limNumerical22361Credit limit
PurposeCategorical0Purpose of the loan
Loan_statusCategorical0Payer/Delinquent

Loan_status is our variable of interest on which we will make the prediction. It has two categories:

  • Payers: People who have paid off all their credit.
    Number of persons in this category: 60432. This represents 61% of all
  • Delinquents: People who have not paid back their credit.
    Number of persons in this category: 38766. This represents 39% of all individuals.

We are not therefore in the context of a problem of strongly imbalanced data where the percentage of the minority class is close to 5%.
In the previously discussed case two techniques could be used:

  • Oversampling: We will duplicate the individuals of the minority class in order to balance the two classes.
  • Undersampling: We will reduce the size of the majority class in order to balance it with the minority class.

The purpose of these two techniques is to include the imbalance in the cost function of the algorithm so as not to discriminate one class in favor of the other.

Feature engineering

The process of engineering predictor variables must be done on training and testing simultaneously to ensure a similar treatment. So we combine both datasets before applying any function.

Filling of the missing values

Here we have to distinguish the two types of variables (categorical and numeric) since the treatment will be different.

For categorical variables, we will create an ‘Unknown’ category that will be targeted to individuals for whom the category of a categorical variable has not been mentioned.

For continuous variables, we will replace NaN by the median and create a binary control variable called ‘missing’ which is 1 if there is a missing value in the line and 0 otherwise.

One hot encoding

We create variables of binary type for each category of each categorical variable: the objective is to replace the strings (not interpretable by the majority of the algorithms) by numerical variables. Thus ‘purpose_medical’ will be worth 1 if the loan purpose of the individual is ‘medical’ in the categorical variable ‘purpose’.

Machine Learning models

We will use three classification algorithms : Logistic regression, random forest and XGBoost classifier which is an improved version of gradient boosting. These algorithms are detailed in What is Supervised Learning?

During the process, it is necessary to choose well the set of predictors and to select the best hyper-parameters. We use the techniques discussed in How to select my features and my hyper-parameters?

Results

At first, we will evaluate the improvement due to the selection of hyper-parameters.
We use the Randomized Search and we obtain the following parameters and scores for each algorithm:

AlgorithmHyper-parametersPrecisionRecallF1
Logistic RegressionPenalty = “elastic net”
L1_ratio = 1
Alpha = 0.01
42%55%47%
Random ForestN_estimators = 300
Max_features = “sqrt”
Criterion=”entropy”
59%34%43%
XGBoost ClassifierN_estimators = 500
Learning_rate = 0.03
Colsample_bylevel = 0.7
Max_depth = 7
59%40%47%

The ROC and precision/recall curves obtained are:

ROC Curve
Precision/Recall Curve

We find that the XGBoost algorithm offers the best performance compared to the three scores: the surfaces under the ROC curve, the precision curve and the F1 score.
As for logistic regression, the area under the ROC curve is close to 0.5, which means that it is always equivalent to a random classifier.

In a second step, we will keep the previously chosen parameters for the XGBoost and will also apply the selection of features for the purpose of reducing the size. The method that will be evaluated is the recursive elimination of predictive variables that is implemented on Scikit-Learn in the ‘RFE’ module. This method requires discrete categorical variables, but since there will be as many executions as there are many variables, it is recommended to use the LabelEncoder which transforms the categorical variables into ordinal variables. It is also implemented in Scikit-Learn.

The obtained ranking is :

Let’s take a closer look at the most important predictor variables. In order for the platform to lend money to a customer, it is important that they have the following information:

  • Annual income (annual_inc)
  • Amount of credit (loan_amt)
  • Hours of work (emp_length)
  • Ratio of indebtedness to income (dti)
  • Revolving credit balance (revol_bal)
  • Number of appeals and public inquiries (inq_last_6_mths)
  • Number of delinquencies from other companies during the last two years (delinq_2yrs)
  • We actually find this information in the ranking.

The results obtained by selecting three quarters of the most important predictor variables are summarized below:

AlgorithmPrecisionRecallF1-scoreROC auc
XGboost Classifier59%40%47%69%

Interpretation

We are therefore able to detect 40% of positive cases (people who have not reimbursed) in the data and to have 59% of the positive precisions that are correct (which means that 40% of positive predictions are false alerts: people who are solvent and labeled as insolvent). The F1-Score reflects the combination of accuracy and recall and is close to 50%.
The area under the ROC curve is 70%, we exceed by 20% that associated with the random classifier.

LEAVE A REPLY

Please enter your comment!
Please enter your name here