Understanding Predictions in Survival Analysis#

What is Survival Analysis?#

The objective in survival analysis — also referred to as reliability analysis in engineering — is to establish a connection between covariates and the time of an event. The name survival analysis originates from clinical research, where predicting the time to death, i.e., survival, is often the main objective. Survival analysis is a type of regression problem (one wants to predict a continuous value), but with a twist. It differs from traditional regression by the fact that parts of the training data can only be partially observed – they are censored.

As an example, consider a clinical study, which investigates cardiovascular disease and has been carried out over a 1 year period as in the figure below.


Patient A was lost to follow-up after three months with no recorded cardiovascular event, patient B experienced an event four and a half months after enrollment, patient C withdrew from the study three and a half months after enrollment, and patient E did not experience any event before the study ended. Consequently, the exact time of a cardiovascular event could only be recorded for patients B and D; their records are uncensored. For the remaining patients it is unknown whether they did or did not experience an event after termination of the study. The only valid information that is available for patients A, C, and E is that they were event-free up to their last follow-up. Therefore, their records are censored.

Formally, each patient record consists of a set of covariates \(x \in \mathbb{R}^d\) , and the time \(t>0\) when an event occurred or the time \(c>0\) of censoring. Since censoring and experiencing and event are mutually exclusive, it is common to define an event indicator \(\delta \in \{0;1\}\) and the observable survival time \(y>0\). The observable time \(y\) of a right censored sample is defined as

\[\begin{split}y = \min(t, c) = \begin{cases} t & \text{if } \delta = 1 , \\ c & \text{if } \delta = 0 . \end{cases}\end{split}\]

Consequently, survival analysis demands for models that take this unique characteristic of such a dataset into account.

Basic Quantities#

Rather than focusing on predicting a single point in time of an event, the prediction step in survival analysis often focuses on predicting a function: either the survival or hazard function. The survival function \(S(t)\) returns the probability of survival beyond time \(t\), i.e., \(S(t) = P(T > t)\), whereas the hazard function \(h(t)\) denotes an approximate probability (it is not bounded from above) that an event occurs in the small time interval \([t; t + \Delta t[\), under the condition that an individual would remain event-free up to time \(t\):

\[h(t) = \lim_{\Delta t \rightarrow 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t} \geq 0 .\]

Alternative names for the hazard function are conditional failure rate, conditional mortality rate, or instantaneous failure rate. In contrast to the survival function, which describes the absence of an event, the hazard function provides information about the occurrence of an event. Finally, the cumulative hazard function \(H(t)\) is the integral over the interval \([0; t]\) of the hazard function:

\[H(t) = \int_0^t h(u)\,du .\]


The survival function \(S(t)\) and cumulative hazard function \(H(t)\) can be estimated from a set of observed time points \(\{(y_1, \delta_i), \ldots, (y_n, \delta_n)\}\) using sksurv.nonparametric.kaplan_meier_estimator() and sksurv.nonparametric.nelson_aalen_estimator(), respectively.

The above estimators are often too simple, because they do not take additional factors into account that could affect survival, e.g. age or a pre-existing condition. Cox’s proportional hazards model (sksurv.linear_model.CoxPHSurvivalAnalysis) provides a way to estimate survival and cumulative hazard function in the presence of additional covariates. This is possible, because it assumes that a baseline hazard function exists and that covariates change the “risk” (hazard) only proportionally. In other words, it assumes that the ratio of the “risk” of experiencing an event of two patients remains constant over time. After fitting Cox’s proportional hazards model, \(S(t)\) and \(H(t)\) can be estimated using sksurv.linear_model.CoxPHSurvivalAnalysis.predict_survival_function() and sksurv.linear_model.CoxPHSurvivalAnalysis.predict_cumulative_hazard_function(), respectively.


For other survival models that do not rely on the proportional hazards assumption, it is often impossible to estimate survival or cumulative hazard function. Their predictions are risk scores of arbitrary scale. If samples are ordered according to their predicted risk score (in ascending order), one obtains the sequence of events, as predicted by the model. This is the return value of the predict() method of all survival models in scikit-survival.

Consequently, predictions are often evaluated by a measure of rank correlation between predicted risk scores and observed time points in the test data. In particular, Harrell’s concordance index (sksurv.metrics.concordance_index_censored()) computes the ratio of correctly ordered (concordant) pairs to comparable pairs and is the default performance metric when calling a survival model’s score() method.