The Client and Setting
This assignment happened at the infancy of what is now one of the bigger marketplace lenders in India. In order to get started, the company had employed a manual underwriter to evaluate its potential borrowers and underwrite loans.
As the company started scaling, they realised that the underwriting bit could become a bottleneck. However, they were still in the early enough days that they didn’t have enough of their own data to build a full credit scoring model.
We don’t have any defaults yet, so we can’t use our data to do credit underwriting. However, we have been collecting a lot of very interesting data from our potential borrowers. And we have a very good manual underwriter.
What we want is to be able to replicate him, and at scale, as we look to scale our loans business. Moreover, automating underwriting can mean we can issue loans much faster than we are doing now.
Our sales team is doing an excellent job, and we don’t want underwriting to become a bottleneck in our growth process.
At the outset, this was a rather straightforward assignment. There was all the data that the client had collected from its potential borrowers, and its manual underwriter’s decisions on whether the loan was to be granted or not.
From a traditional standpoint, it was a simple exercise in supervised learning – we had all these parameters and a binary outcome, and we were to build a model based on this.
In practice, however, it was not so straightforward. Exploratory data analysis told us that there were significant correlations between many of the data points that the company was collecting and the ultimate decision on whether to offer a loan or not.
Digging deeper, however, it became clear that many of these data points that showed correlations with the outcome were highly correlated among themselves. Any model that treated these variables as being independent was bound to lead to a flawed exercise and spurious results.
Results and outcomes
This was, unfortunately, one of those exercises where there was no real positive outcome. Sometimes it can happen that there is “no signal in the data”, and this was one of those cases.
The internal correlations on the explanatory variables had meant that any model based on those would necessarily be flawed. Hence, rather than trying to replicate the manual underwriter in code, in consultation with the clients I decided to put in a priority system for incoming loan requests.
This way, loans that had a higher chance of being approved would be seen by the underwriter earlier, thus reducing the average time taken to approve a successful loan application.