Using Survival Analysis to Model Time to Default

Contributed by: Lore Dirick, Bart Baesens, Gerda Claeskens

This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!


With the recent financial crisis hitting the US and Europe, credit risk modeling has become more important than ever. The introduction of compliance guidelines such as the Basel and Solvency accords has had a huge impact on the strategies of financial institutions. Mainly triggered by the Basel II accord statement that allows large banks to assess risk based on their own models in order to determine the minimum amount of capital they need to hold as a buffer against unexpected losses, extensive research in the credit risk domain was encouraged. The bank’s objective here is to build a model that can assess a borrower’s risk as accurately as possible. This is where survival analysis can play an important role.

With the probability of default (PD) as a key credit risk parameter, credit risk models typically aim at distinguishing “good” customers from “bad” customers. This is historically done through classification techniques such as decision trees, neural networks and logistic regression. A disadvantage of these techniques, however, is that they do not take the timing of default into account. When using survival analysis, we are able to predict when customers are likely to default. Not only can this lead to more accurate credit risk calculations, there are some other advantages that will be discussed in what follows.

Firstly, survival analysis can deal with censoring. In the credit risk context, censoring occurs when a specific loan is under repayment at the moment of data gathering. At that stage, it is clear that default has not occurred. However, as the loan has not reached the end term yet, we cannot draw final conclusions on whether in the end this particular loan will be a defaulted one or not – or in classification terms, whether we are dealing with a “good” or a “bad” customer. When using traditional classification techniques, it is not possible to include the information regarding this customer (functioning as a data point) as an input in the model. Focusing on the time aspect of default, censoring can be dealt with when using survival analysis techniques, and the valuable information “customer X with characteristics Y has at least been repaying for Z months” can be taken into account. The advantage of not being forced to leave out these censored cases is straightforward: as more information can be included when building a model, one is able to make more accurate predictions when using survival analysis models as opposed to standard classification techniques.

A natural consequence of using survival analysis, and at the same time a second advantage, is the fact that this method does not produce one PD estimate, but a range of PD estimates, depending on time, per customer. A PDXYZ can then be interpreted as “what is the probability that customer X with characteristics Y will have defaulted by month  Z”. Incorporating the risk of default using these time-dependent PD, it is child’s play to make accurate predictions on the expected revenue for a specific loan.

Classically, PD models include application information of loan applicants as fixed predictor variables. This application information may include home ownership information, age, marital status,… A pitfall of including these as fixed variables is that most of these variables can change over time:  creditors can buy or sell a house, get married or divorced,… while a loan is under repayment. These specific events will most likely change the probability of default. A third advantage of survival analysis is that these method allows us to include these so-called time-dependent covariates. Other covariates that are typically of this nature can be the current account balance (which changes each month!) and covariates that are not related to a specific customer but to the state of the economy. These so-called “macro-economic factors”, such as unemployment level, interest rate and house price index play a very important role in credit risk modeling.

As a final advantage we refer to the possibility to model several event types in one model. As survival analysis models the time to a specific event, not only default can be taken into consideration, but also the early repayment of a loan. Through more complex mixtures of survival models, one can compute the probabilities of default and early repayment for a specific customer all in one model.

Despite the many advantages, banks are still hesitant towards the use of survival analysis. While there are still some challenges, such as dealing with the generally high amount of censoring due to low default rates and more difficult interpretation of these models with respect to standard classification techniques, the potential of survival analysis in the credit risk area is enormous. It will only a matter of time before survival analysis will conquer the financial world!