Lending Club is an American participative finance company based in San Fransisco. It was founded by Renaud Laplanche in 2006. It is a platform of loan between individuals which allows the borrowers to have access to credits at rates lower than those proposed by the traditional banks and allows the investors to invest their savings in the form of a loan to a community that verifies certain conditions (risk associated with the borrower, social or solidarity dimension, etc.).
For a bank or institute that provides credit, it is crucial to know the clients and their ability to pay back. Traditional credit scoring techniques already exist and take into account the different characteristics of a client, however, they can’t adapt to a new environment easily unlike machine learning techniques that can detect unseen relationships and behaviors.
The model predictions can be expressed in the form of solvency probability or in the form of
To conduct our problem resolution, we will use a partial dataset from the Lending Club competition on Kaggle.
|Id||Numerical||0||Id of the client|
|Zip_code||Categorical||0||Zip code in the US|
|Dti||Numerical||0||Debt to Income|
|Delinq_2yrs||Numerical||0||Number of delinquencies the past two years|
|Inq_last_6mths||Numerical||0||Number of inquiries the last 6 months|
|Open_acc||Numerical||0||Number of open transactions the past 6 months|
|Pub_rec||Numerical||0||Number of derogatory public records|
|Revol_bal||Numerical||0||Balance of revolving credit|
|Revol_util||Numerical||93||The rate of use of renewable lines|
|Total_acc||Numerical||0||The total number of lines of credit currently in the borrower’s credit file|
|Collections_12_mths_ex_med||Numerical||13||Number of collections in 12 months excluding medical collections|
|Acc_now_delinq||Numerical||0||The number of accounts on which the borrower is now late|
|Tot_coll_amt||Numerical||22361||Total amount due|
|Tot_cur_bal||Numerical||22361||Total balance of the account|
|Purpose||Categorical||0||Purpose of the loan|
Loan_status is our variable of interest on which we will make the prediction. It has two categories:
- Payers: People who have paid off all their credit.
Number of persons in this category: 60432. This represents 61% of all
- Delinquents: People who have not paid back their credit.
Number of persons in this category: 38766. This represents 39% of all individuals.
We are not therefore in the context of a problem of strongly imbalanced data where the percentage of the minority class is close to 5%.
In the previously discussed case two techniques could be used:
- Oversampling: We will duplicate the individuals of the minority class in order to balance the two classes.
- Undersampling: We will reduce the size of the majority class in order to balance it with the minority class.
The purpose of these two techniques is to include the imbalance in the cost function of the algorithm so as not to discriminate one class in favor of the other.
The process of engineering predictor variables must be done on training and testing simultaneously to ensure a similar treatment. So we combine both datasets before applying any function.
Filling of the missing values
Here we have to distinguish the two types of variables (categorical and numeric) since the treatment will be different.
For categorical variables, we will create an ‘Unknown’ category that will be targeted to individuals for whom the category of a categorical variable has not been mentioned.
For continuous variables, we will replace NaN by the median and create a binary control variable called ‘missing’ which is 1 if there is a missing value in the line and 0 otherwise.
One hot encoding
We create variables of binary type for each category of each categori
Machine Learning models
We will use three classification algorithms : Logistic regression, random forest and XGBoost classifier which is an improved version of gradient boosting. These algorithms are detailed in What is Supervised Learning?
During the process, it is necessary to choose well the set of predictors and to select the best hyper-parameters. We use the techniques discussed in How to select my features and my hyper-parameters?
At first, we will evaluate the improvement due to the selection of hyper-parameters.
We use the Randomized Search and we obtain the following parameters and scores for each algorithm:
|Logistic Regression||Penalty = “elastic net”|
L1_ratio = 1
Alpha = 0.01
|Random Forest||N_estimators = 300|
Max_features = “sqrt”
|XGBoost Classifier||N_estimators = 500|
Learning_rate = 0.03
Colsample_bylevel = 0.7
Max_depth = 7
The ROC and precision/recall curves obtained are:
We find that the XGBoost algorithm offers the best performance compared to the three scores: the surfaces under the ROC curve, the precision curve and the F1 score.
As for logistic regression, the area under the ROC curve is close to 0.5, which means that it is always equivalent to a random classifier.
In a second step, we will keep the previously chosen parameters for the XGBoost and will also apply the selection of features for the purpose of reducing the size. The method that will be evaluated is the recursive elimination of predictive variables that is implemented on Scikit-Learn in the ‘RFE’ module. This method requires discrete categorical variables, but since there will be as many executions as there are many variables, it is recommended to use the LabelEncoder which transforms the categorical variables into ordinal variables. It is also implemented in Scikit-Learn.
The obtained ranking is :
Let’s take a closer look at the most important predictor variables. In order for the platform to lend money to a customer, it is important that they have the following information:
- Annual income (annual_inc)
- Amount of credit (loan_amt)
- Hours of work (emp_length)
- Ratio of indebtedness to income (dti)
- Revolving credit balance (revol_bal)
- Number of appeals and public inquiries (inq_last_6_mths)
- Number of delinquencies from other companies during the last two years (delinq_2yrs)
- We actually find this information in the ranking.
The results obtained by selecting three quarters of the most important predictor variables are summarized below:
We are therefore able to detect 40% of positive cases (people who have not reimbursed) in the data and to have 59% of the positive precisions that are correct (which means that 40% of positive predictions are false alerts: people who are solvent and labeled as insolvent). The F1-Score reflects the combination of accuracy and recall and is close to 50%.
The area under the ROC curve is 70%, we exceed by 20% that associated with the random classifier.