THE term ‘Big Data’ is used to refer to a set of techniques allowing the processing and storage of extremely large data sets. The most important of these techniques were set out in a series of papers published by Jeffrey Dean, Sanjay Ghemawat and others, which are known as the MapReduce, GFS and Bigtable papers.
MapReduce showed a generic technique for decomposing very complex operations into a series of simpler operations known as a map, reduce and shuffle. As long as certain sequencing barriers are preserved, these operations can be performed in parallel with great efficiency over subsets of the data on a cluster of machines, allowing this cluster to collectively process data sets far larger than any single machine (or ‘node’) could handle. Similarly, GFS and Bigtable showed distributed methods of storage, retrieval and processing of very large data sets in ways that were fault-tolerant and horizontally scalable.
There are a wide variety of problems that become tractable only once this type of large-scale analysis becomes possible. For the purposes of SME and midcap credit analysis such as that which OakNorth does, however, the challenge is subtly different. Here, the primary challenge is not the size of the data sets themselves (after all in this area, data is extremely hard to come by) but instead the multitude of sources, each with different formats. Therefore, the typical techniques of MapReduce et al (which process large numbers of items efficiently) are less useful than the ability to upload and store many input sources and map the data within them into a common data model to allow for like-for-like comparison and further processing.
Another contributing factor has been regulatory concerns about the understandability of models. In the aftermath of the 2008 financial crisis, regulators have been reluctant to sanction any model that is not easily understandable, as opaque models depend on assumptions made in good times that may not hold in conditions of extreme market stress or dislocation, and this can lead to broader nonlinear effects and increased systemic risk.
For example, in pricing residential mortgage bonds before 2008, a common simplifying assumption was that loans underlying a given security were independent, and therefore a default in one loan would not affect another. It is clear, however, that there is not just default correlation but default correlation skew: over the majority of the credit cycle, defaults are uncorrelated, but if the owner of one house on a street defaults on their mortgage in the downcycle, this brings house prices down and therefore makes it far more likely that several other mortgages in that area will default. In this scenario, both defaults and default correlation go up. Because the loans are not independent, once defaults start, the value of the collateral underlying the bond will tend to deteriorate rapidly.
Many AI/ML techniques (in particular neural nets) result in models where the relationships between the inputs and the outputs are not intuitive or easily understood a priori by humans. Indeed, they are valued particularly for their ability to model problems where simple, intuitive solutions are not well known or understood. Sometimes, however, conventional approaches also have significant shortcomings. In the case of credit cards, mortgages and other personal loans, a common method of assessing the credit quality of a given portfolio is to use a Markov chain model. First, properties of the borrower of the loan are used to create a credit score or rating, then a state transition matrix is used to model the probability that loans in a given rating category will transition into another category (either a different risk score, delinquency or default) at a given time in the future. The estimation of these state transition probabilities is done using conventional statistical approaches over historical data.
This allows prediction of the level of defaults in a given portfolio given the ratings on the loans in the portfolio using Monte Carlo simulation. While this approach can be very successful in situations where there is historical performance data for a large number of similar loans (e.g. credit cards or auto loans), in the case of SME and midcap lending it may be difficult or impossible to get data on enough similar loans in any given category to have confidence that the transition matrix is correct and that performance is well understood. Since there are very significant differences between businesses, even two businesses that at first glance appear similar (two hotels in an area with a similar rating) may in fact be very different (e.g. one catering to budget business travellers versus a destination boutique hotel for tourists), and therefore those businesses would respond to different economic stresses.
Fundamental corporate credit analysis requires the construction of a financial model of a business or business plan, and then the use of this model to assess performance in a variety of stress scenarios. The tasks break down into those concerning the construction of the model (including how to build the model given the input data from a given business and how to compare it with external data points, including comparable businesses) and those around constructing the stress scenarios. Finally, these outputs are typically assessed by an analyst to determine the sensitivity of the borrower’s plan to given stress factors and therefore the serviceability of debt and sustainability of the business should things go wrong. The advantage of this approach is that the model is fairly simple to understand and assumptions are relatively easy to validate using external data. These modelling tasks are, however, necessarily bespoke for each business, and as such a ‘one size fits all’ approach will not yield good results.
As described in contrast to credit-card lending above, there is nowhere near enough historical data to fit a conventional credit model, and this same lack of data would also prevent a typical ML approach (such as unsupervised learning) from performing well in producing bespoke financial models for business borrowers.
For this reason, we believe that the human/computer or man and machine symbiosis holds the key to unlocking automation in this space. While there is not enough data to produce and fit a general model that would accurately assess all corporate credit analysis cases in this class, to perform this task in a fully manual fashion requires the analyst to perform very many tasks, some of which can be automated given ML techniques applied to the data we do have. Human judgment should always influence the outcome and help ensure understandability of outputs.
By Sean Hunter, CIO at OakNorth