scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizing the power of scikit-learn, e.g., for pre-processing or doing cross-validation.
The objective in survival analysis (also referred to as time-to-event or reliability analysis) is to establish a connection between covariates and the time of an event. What makes survival analysis differ from traditional machine learning is the fact that parts of the training data can only be partially observed – they are censored.
For instance, in a clinical study, patients are often monitored for a particular time period, and events occurring in this particular period are recorded. If a patient experiences an event, the exact time of the event can be recorded – the patient’s record is uncensored. In contrast, right censored records refer to patients that remained event-free during the study period and it is unknown whether an event has or has not occurred after the study ended. Consequently, survival analysis demands for models that take this unique characteristic of such a dataset into account.
The easiest way to install scikit-survival is to use Anaconda by running:
conda install -c sebp scikit-survival
Alternatively, you can install scikit-survival from source following this guide.
The user guide provides in-depth information on the key concepts of scikit-survival, an overview of available survival models, and hands-on examples.
The reference guide contains a detailed description of the scikit-survival API. It describes which classes and functions are available and what their parameters are.
Saw a typo in the documentation? Want to add new functionalities? The contributing guidelines will guide you through the process of setting up a development environment and submitting your changes to the scikit-survival team.
This is the recommended and easiest to install scikit-survival is to use Anaconda. Alternatively, you can install scikit-survival From Source.
Pre-built binary packages for Linux, MacOS, and Windows are available for Anaconda. If you have Anaconda installed, run:
If you want to build scikit-survival from source, you will need a C/C++ compiler to compile extensions.
Linux
On Linux, you need to install gcc, which in most cases is available via your distribution’s packaging system. Please follow your distribution’s instructions on how to install packages.
MacOS
On MacOS, you need to install clang, which is available from the Command Line Tools package. Open a terminal and execute:
xcode-select --install
Alternatively, you can download it from the Apple Developers page. Log in with your Apple ID, then search and download the Command Line Tools for Xcode package.
Windows
On Windows, the compiler you need depends on the Python version you are using. See this guide to determine which Microsoft Visual C++ compiler to use with a specific Python version.
To install the latest release of scikit-survival from source, run:
pip install scikit-survival
Note
If you have not installed the dependencies previously, this command will first install all dependencies before installing scikit-survival. Therefore, installation might fail if build requirements of some dependencies are not met. In particular, osqp does require CMake to be installed.
To install the latest source from our GitHub repository, you need to have Git installed and simply run:
pip install git+https://github.com/sebp/scikit-survival.git
The current minimum dependencies to run scikit-survival are:
Python 3.7 or later
ecos
joblib
numexpr
numpy 1.16 or later
osqp
pandas 0.25 or later
scikit-learn 0.22 or 0.23
scipy 1.0 or later
C/C++ compiler
The User Guide covers the most important aspects of doing to survival analysis with scikit-survival.
It is assumed that users have a basic understanding of survival analysis. If you are brand-new to survival analysis, consider studying the basics first, e.g. by reading an introductory book, such as
David G. Kleinbaum and Mitchel Klein (2012), Survival Analysis: A Self-Learning Text, Springer.
John P. Klein and Melvin L. Moeschberger (2003), Survival Analysis: Techniques for Censored and Truncated Data, Springer.
Users new to scikit-survival should read Understanding Predictions in Survival Analysis to get familiar with the basic concepts. The interactive guide Introduction to Survival Analysis with scikit-survival gives a brief overview of how to use scikit-survival for survival analysis. Once you are familiar with the basics, it is highly recommended reading the guide Evaluating Survival Models, which discusses common pitfalls when evaluating the predictive performance of survival models. Finally, there are several model-specific guides that discuss details about particular models, with many examples throughout.
The objective in survival analysis — also referred to as reliability analysis in engineering — is to establish a connection between covariates and the time of an event. The name survival analysis originates from clinical research, where predicting the time to death, i.e., survival, is often the main objective. Survival analysis is a type of regression problem (one wants to predict a continuous value), but with a twist. It differs from traditional regression by the fact that parts of the training data can only be partially observed – they are censored.
As an example, consider a clinical study, which investigates cardiovascular disease and has been carried out over a 1 year period as in the figure below.
Patient A was lost to follow-up after three months with no recorded cardiovascular event, patient B experienced an event four and a half months after enrollment, patient C withdrew from the study three and a half months after enrollment, and patient E did not experience any event before the study ended. Consequently, the exact time of a cardiovascular event could only be recorded for patients B and D; their records are uncensored. For the remaining patients it is unknown whether they did or did not experience an event after termination of the study. The only valid information that is available for patients A, C, and E is that they were event-free up to their last follow-up. Therefore, their records are censored.
Formally, each patient record consists of a set of covariates \(x \in \mathbb{R}^d\) , and the time \(t>0\) when an event occurred or the time \(c>0\) of censoring. Since censoring and experiencing and event are mutually exclusive, it is common to define an event indicator \(\delta \in \{0;1\}\) and the observable survival time \(y>0\). The observable time \(y\) of a right censored sample is defined as
Consequently, survival analysis demands for models that take this unique characteristic of such a dataset into account.
Rather than focusing on predicting a single point in time of an event, the prediction step in survival analysis often focuses on predicting a function: either the survival or hazard function. The survival function \(S(t)\) returns the probability of survival beyond time \(t\), i.e., \(S(t) = P(T > t)\), whereas the hazard function \(h(t)\) denotes an approximate probability (it is not bounded from above) that an event occurs in the small time interval \([t; t + \Delta t[\), under the condition that an individual would remain event-free up to time \(t\):
Alternative names for the hazard function are conditional failure rate, conditional mortality rate, or instantaneous failure rate. In contrast to the survival function, which describes the absence of an event, the hazard function provides information about the occurrence of an event. Finally, the cumulative hazard function \(H(t)\) is the integral over the interval \([0; t]\) of the hazard function:
The survival function \(S(t)\) and cumulative hazard function \(H(t)\) can be estimated from a set of observed time points \(\{(y_1, \delta_i), \ldots, (y_n, \delta_n)\}\) using sksurv.nonparametric.kaplan_meier_estimator() and sksurv.nonparametric.nelson_aalen_estimator(), respectively.
sksurv.nonparametric.kaplan_meier_estimator()
sksurv.nonparametric.nelson_aalen_estimator()
The above estimators are often too simple, because they do not take additional factors into account that could affect survival, e.g. age or a pre-existing condition. Cox’s proportional hazards model (sksurv.linear_model.CoxPHSurvivalAnalysis) provides a way to estimate survival and cumulative hazard function in the presence of additional covariates. This is possible, because it assumes that a baseline hazard function exists and that covariates change the “risk” (hazard) only proportionally. In other words, it assumes that the ratio of the “risk” of experiencing an event of two patients remains constant over time. After fitting Cox’s proportional hazards model, \(S(t)\) and \(H(t)\) can be estimated using sksurv.linear_model.CoxPHSurvivalAnalysis.predict_survival_function() and sksurv.linear_model.CoxPHSurvivalAnalysis.predict_cumulative_hazard_function(), respectively.
sksurv.linear_model.CoxPHSurvivalAnalysis
sksurv.linear_model.CoxPHSurvivalAnalysis.predict_survival_function()
sksurv.linear_model.CoxPHSurvivalAnalysis.predict_cumulative_hazard_function()
Important
For other survival models that do not rely on the proportional hazards assumption, it is often impossible to estimate survival or cumulative hazard function. Their predictions are risk scores of arbitrary scale. If samples are ordered according to their predicted risk score (in ascending order), one obtains the sequence of events, as predicted by the model. This is the return value of the predict() method of all survival models in scikit-survival.
predict()
Consequently, predictions are often evaluated by a measure of rank correlation between predicted risk scores and observed time points in the test data. In particular, Harrell’s concordance index (sksurv.metrics.concordance_index_censored()) computes the ratio of correctly ordered (concordant) pairs to comparable pairs and is the default performance metric when calling a survival model’s score() method.
sksurv.metrics.concordance_index_censored()
score()
What is Survival Analysis?
The Veterans’ Administration Lung Cancer Trial
Survival Data
The Survival Function
Considering other variables by stratification
Multivariate Survival Models
Measuring the Performance of Survival Models
Feature Selection: Which Variable is Most Predictive?
What’s next?
As an example, consider a clinical study, which investigates coronary heart disease and has been carried out over a 1 year period as in the figure below.
Patient A was lost to follow-up after three months with no recorded cardiovascular event, patient B experienced an event four and a half months after enrollment, patient D withdrew from the study two months after enrollment, and patient E did not experience any event before the study ended. Consequently, the exact time of a cardiovascular event could only be recorded for patients B and C; their records are uncensored. For the remaining patients it is unknown whether they did or did not experience an event after termination of the study. The only valid information that is available for patients A, D, and E is that they were event-free up to their last follow-up. Therefore, their records are censored.
Consequently, survival analysis demands for models that take this unique characteristic of such a dataset into account, some of which are showcased below.
The Veterans’ Administration Lung Cancer Trial is a randomized trial of two treatment regimens for lung cancer. The data set (Kalbfleisch J. and Prentice R, (1980) The Statistical Analysis of Failure Time Data. New York: Wiley) consists of 137 patients and 8 variables, which are described below:
Treatment: denotes the type of lung cancer treatment; standard and test drug.
Treatment
standard
test
Celltype: denotes the type of cell involved; squamous, small cell, adeno, large.
Celltype
squamous
small cell
adeno
large
Karnofsky_score: is the Karnofsky score.
Karnofsky_score
Diag: is the time since diagnosis in months.
Diag
Age: is the age in years.
Age
Prior_Therapy: denotes any prior therapy; none or yes.
Prior_Therapy
none
yes
Status: denotes the status of the patient as dead or alive; dead or alive.
Status
dead
alive
Survival_in_days: is the survival time in days since the treatment.
Survival_in_days
Our primary interest is studying whether there are subgroups that differ in survival and whether we can predict survival times.
As described in the section What is Survival Analysis? above, survival times are subject to right-censoring, therefore, we need to consider an individual’s status in addition to survival time. To be fully compatible with scikit-learn, Status and Survival_in_days need to be stored as a structured array with the first field indicating whether the actual survival time was observed or if was censored, and the second field denoting the observed survival time, which corresponds to the time of death (if Status == 'dead', \(\delta = 1\)) or the last time that person was contacted (if Status == 'alive', \(\delta = 0\)).
Status == 'dead'
Status == 'alive'
[1]:
from sksurv.datasets import load_veterans_lung_cancer data_x, data_y = load_veterans_lung_cancer() data_y
array([( True, 72.), ( True, 411.), ( True, 228.), ( True, 126.), ( True, 118.), ( True, 10.), ( True, 82.), ( True, 110.), ( True, 314.), (False, 100.), ( True, 42.), ( True, 8.), ( True, 144.), (False, 25.), ( True, 11.), ( True, 30.), ( True, 384.), ( True, 4.), ( True, 54.), ( True, 13.), (False, 123.), (False, 97.), ( True, 153.), ( True, 59.), ( True, 117.), ( True, 16.), ( True, 151.), ( True, 22.), ( True, 56.), ( True, 21.), ( True, 18.), ( True, 139.), ( True, 20.), ( True, 31.), ( True, 52.), ( True, 287.), ( True, 18.), ( True, 51.), ( True, 122.), ( True, 27.), ( True, 54.), ( True, 7.), ( True, 63.), ( True, 392.), ( True, 10.), ( True, 8.), ( True, 92.), ( True, 35.), ( True, 117.), ( True, 132.), ( True, 12.), ( True, 162.), ( True, 3.), ( True, 95.), ( True, 177.), ( True, 162.), ( True, 216.), ( True, 553.), ( True, 278.), ( True, 12.), ( True, 260.), ( True, 200.), ( True, 156.), (False, 182.), ( True, 143.), ( True, 105.), ( True, 103.), ( True, 250.), ( True, 100.), ( True, 999.), ( True, 112.), (False, 87.), (False, 231.), ( True, 242.), ( True, 991.), ( True, 111.), ( True, 1.), ( True, 587.), ( True, 389.), ( True, 33.), ( True, 25.), ( True, 357.), ( True, 467.), ( True, 201.), ( True, 1.), ( True, 30.), ( True, 44.), ( True, 283.), ( True, 15.), ( True, 25.), (False, 103.), ( True, 21.), ( True, 13.), ( True, 87.), ( True, 2.), ( True, 20.), ( True, 7.), ( True, 24.), ( True, 99.), ( True, 8.), ( True, 99.), ( True, 61.), ( True, 25.), ( True, 95.), ( True, 80.), ( True, 51.), ( True, 29.), ( True, 24.), ( True, 18.), (False, 83.), ( True, 31.), ( True, 51.), ( True, 90.), ( True, 52.), ( True, 73.), ( True, 8.), ( True, 36.), ( True, 48.), ( True, 7.), ( True, 140.), ( True, 186.), ( True, 84.), ( True, 19.), ( True, 45.), ( True, 80.), ( True, 52.), ( True, 164.), ( True, 19.), ( True, 53.), ( True, 15.), ( True, 43.), ( True, 340.), ( True, 133.), ( True, 111.), ( True, 231.), ( True, 378.), ( True, 49.)], dtype=[('Status', '?'), ('Survival_in_days', '<f8')])
We can easily see that only a few survival times are right-censored (Status is False), i.e., most veteran’s died during the study period (Status is True).
False
True
A key quantity in survival analysis is the so-called survival function, which relates time to the probability of surviving beyond a given time point.
Let \(T\) denote a continuous non-negative random variable corresponding to a patient’s survival time. The survival function \(S(t)\) returns the probability of survival beyond time \(t\) and is defined as \[S(t) = P (T > t).\]
Let \(T\) denote a continuous non-negative random variable corresponding to a patient’s survival time. The survival function \(S(t)\) returns the probability of survival beyond time \(t\) and is defined as
If we observed the exact survival time of all subjects, i.e., everyone died before the study ended, the survival function at time \(t\) can simply be estimated by the ratio of patients surviving beyond time \(t\) and the total number of patients:
In the presence of censoring, this estimator cannot be used, because the numerator is not always defined. For instance, consider the following set of patients:
[2]:
import pandas as pd pd.DataFrame.from_records(data_y[[11, 5, 32, 13, 23]], index=range(1, 6))
Using the formula from above, we can compute \(\hat{S}(t=11) = \frac{3}{5}\), but not \(\hat{S}(t=30)\), because we don’t know whether the 4th patient is still alive at \(t = 30\), all we know is that when we last checked at \(t = 25\), the patient was still alive.
An estimator, similar to the one above, that is valid if survival times are right-censored is the Kaplan-Meier estimator.
[3]:
%matplotlib inline import matplotlib.pyplot as plt from sksurv.nonparametric import kaplan_meier_estimator time, survival_prob = kaplan_meier_estimator(data_y["Status"], data_y["Survival_in_days"]) plt.step(time, survival_prob, where="post") plt.ylabel("est. probability of survival $\hat{S}(t)$") plt.xlabel("time $t$")
Text(0.5, 0, 'time $t$')
The estimated curve is a step function, with steps occurring at time points where one or more patients died. From the plot we can see that most patients died in the first 200 days, as indicated by the steep slope of the estimated survival function in the first 200 days.
Patients enrolled in the Veterans’ Administration Lung Cancer Trial were randomized to one of two treatments: standard and a new test drug. Next, let’s have a look at how many patients underwent the standard treatment and how many received the new drug.
[4]:
data_x["Treatment"].value_counts()
standard 69 test 68 Name: Treatment, dtype: int64
Roughly half the patients received the alternative treatment.
The obvious questions to ask is: > Is there any difference in survival between the two treatment groups?
As a first attempt, we can estimate the survival function in both treatment groups separately.
[5]:
for treatment_type in ("standard", "test"): mask_treat = data_x["Treatment"] == treatment_type time_treatment, survival_prob_treatment = kaplan_meier_estimator( data_y["Status"][mask_treat], data_y["Survival_in_days"][mask_treat]) plt.step(time_treatment, survival_prob_treatment, where="post", label="Treatment = %s" % treatment_type) plt.ylabel("est. probability of survival $\hat{S}(t)$") plt.xlabel("time $t$") plt.legend(loc="best")
<matplotlib.legend.Legend at 0x7fd9af20c550>
Unfortunately, the results are inconclusive, because the difference between the two estimated survival functions is too small to confidently argue that the drug affects survival or not.
Sidenote: Visually comparing estimated survival curves in order to assess whether there is a difference in survival between groups is usually not recommended, because it is highly subjective. Statistical tests such as the log-rank test are usually more appropriate.
Next, let’s have a look at the cell type, which has been recorded as well, and repeat the analysis from above.
[6]:
for value in data_x["Celltype"].unique(): mask = data_x["Celltype"] == value time_cell, survival_prob_cell = kaplan_meier_estimator(data_y["Status"][mask], data_y["Survival_in_days"][mask]) plt.step(time_cell, survival_prob_cell, where="post", label="%s (n = %d)" % (value, mask.sum())) plt.ylabel("est. probability of survival $\hat{S}(t)$") plt.xlabel("time $t$") plt.legend(loc="best")
<matplotlib.legend.Legend at 0x7fd9af219e20>
In this case, we observe a pronounced difference between two groups. Patients with squamous or large cells seem to have a better prognosis compared to patients with small or adeno cells.
In the Kaplan-Meier approach used above, we estimated multiple survival curves by dividing the dataset into smaller sub-groups according to a variable. If we want to consider more than 1 or 2 variables, this approach quickly becomes infeasible, because subgroups will get very small. Instead, we can use a linear model, Cox’s proportional hazard’s model, to estimate the impact each variable has on survival.
First however, we need to convert the categorical variables in the data set into numeric values.
[7]:
from sksurv.preprocessing import OneHotEncoder data_x_numeric = OneHotEncoder().fit_transform(data_x) data_x_numeric.head()
Survival models in scikit-survival follow the same rules as estimators in scikit-learn, i.e., they have a fit method, which expects a data matrix and a structured array of survival times and binary event indicators.
fit
[8]:
from sksurv.linear_model import CoxPHSurvivalAnalysis estimator = CoxPHSurvivalAnalysis() estimator.fit(data_x_numeric, data_y)
CoxPHSurvivalAnalysis()
The result is a vector of coefficients, one for each variable, where each value corresponds to the log hazard ratio.
[9]:
pd.Series(estimator.coef_, index=data_x_numeric.columns)
Age_in_years -0.008549 Celltype=large -0.788672 Celltype=smallcell -0.331813 Celltype=squamous -1.188299 Karnofsky_score -0.032622 Months_from_Diagnosis -0.000092 Prior_therapy=yes 0.072327 Treatment=test 0.289936 dtype: float64
Using the fitted model, we can predict a patient-specific survival function, by passing an appropriate data matrix to the estimator’s predict_survival_function method.
predict_survival_function
First, let’s create a set of four synthetic patients.
[10]:
x_new = pd.DataFrame.from_dict({ 1: [65, 0, 0, 1, 60, 1, 0, 1], 2: [65, 0, 0, 1, 60, 1, 0, 0], 3: [65, 0, 1, 0, 60, 1, 0, 0], 4: [65, 0, 1, 0, 60, 1, 0, 1]}, columns=data_x_numeric.columns, orient='index') x_new
Similar to kaplan_meier_estimator, the predict_survival_function method returns a sequence of step functions, which we can plot.
kaplan_meier_estimator
[11]:
import numpy as np pred_surv = estimator.predict_survival_function(x_new) time_points = np.arange(1, 1000) for i, surv_func in enumerate(pred_surv): plt.step(time_points, surv_func(time_points), where="post", label="Sample %d" % (i + 1)) plt.ylabel("est. probability of survival $\hat{S}(t)$") plt.xlabel("time $t$") plt.legend(loc="best")
<matplotlib.legend.Legend at 0x7fd9aec07280>
Once we fit a survival model, we usually want to assess how well a model can actually predict survival. Our test data is usually subject to censoring too, therefore metrics like root mean squared error or correlation are unsuitable. Instead, we use generalization of the area under the receiver operating characteristic (ROC) curve called Harrell’s concordance index or c-index.
The interpretation is identical to the traditional area under the ROC curve metric for binary classification: - a value of 0.5 denotes a random model, - a value of 1.0 denotes a perfect model, - a value of 0.0 denotes a perfectly wrong model.
[12]:
from sksurv.metrics import concordance_index_censored prediction = estimator.predict(data_x_numeric) result = concordance_index_censored(data_y["Status"], data_y["Survival_in_days"], prediction) result[0]
0.7362562471603816
or alternatively
[13]:
estimator.score(data_x_numeric, data_y)
Our model’s c-index indicates that the model clearly performs better than random, but is also far from perfect.
The model above considered all available variables for prediction. Next, we want to investigate which single variable is the best risk predictor. Therefore, we fit a Cox model to each variable individually and record the c-index on the training set.
[14]:
import numpy as np def fit_and_score_features(X, y): n_features = X.shape[1] scores = np.empty(n_features) m = CoxPHSurvivalAnalysis() for j in range(n_features): Xj = X[:, j:j+1] m.fit(Xj, y) scores[j] = m.score(Xj, y) return scores scores = fit_and_score_features(data_x_numeric.values, data_y) pd.Series(scores, index=data_x_numeric.columns).sort_values(ascending=False)
Karnofsky_score 0.709280 Celltype=smallcell 0.572581 Celltype=large 0.561620 Celltype=squamous 0.550545 Treatment=test 0.525386 Age_in_years 0.515107 Months_from_Diagnosis 0.509030 Prior_therapy=yes 0.494434 dtype: float64
Karnofsky_score is the best variable, whereas Months_from_Diagnosis and Prior_therapy='yes' have almost no predictive power on their own.
Months_from_Diagnosis
Prior_therapy='yes'
Next, we want to build a parsimonious model by excluding irrelevant features. We could use the ranking from above, but would need to determine what the optimal cut-off should be. Luckily, scikit-learn has built-in support for performing grid search.
First, we create a pipeline that puts all the parts together.
[15]:
from sklearn.feature_selection import SelectKBest from sklearn.pipeline import Pipeline pipe = Pipeline([('encode', OneHotEncoder()), ('select', SelectKBest(fit_and_score_features, k=3)), ('model', CoxPHSurvivalAnalysis())])
Next, we need to define the range of parameters we want to explore during grid search. Here, we want to optimize the parameter k of the SelectKBest class and allow k to vary from 1 feature to all 8 features.
k
SelectKBest
[16]:
from sklearn.model_selection import GridSearchCV, KFold param_grid = {'select__k': np.arange(1, data_x_numeric.shape[1] + 1)} cv = KFold(n_splits=3, random_state=1, shuffle=True) gcv = GridSearchCV(pipe, param_grid, return_train_score=True, cv=cv) gcv.fit(data_x, data_y) results = pd.DataFrame(gcv.cv_results_).sort_values(by='mean_test_score', ascending=False) results.loc[:, ~results.columns.str.endswith("_time")]
The results show that it is sufficient to select the 3 most predictive features.
[17]:
pipe.set_params(**gcv.best_params_) pipe.fit(data_x, data_y) encoder, transformer, final_estimator = [s[1] for s in pipe.steps] pd.Series(final_estimator.coef_, index=encoder.encoded_columns_[transformer.get_support()])
Celltype=large -0.754714 Celltype=smallcell -0.328059 Celltype=squamous -1.147673 Karnofsky_score -0.031112 Treatment=test 0.257313 dtype: float64
Cox’s proportional hazards model is by far the most popular survival model, because once trained, it is easy to interpret. However, if prediction performance is the main objective, more sophisticated, non-linear or ensemble models might lead to better results. Before you dive deeper into the various survival models, it is highly recommended reading this notebook for getting a better understanding on how to evaluate survival models. The User Guide is a good starting point to learn more about various models implemented in scikit-survival, and the API reference contains a full list of available classes and functions and their parameters. In addition, you can use any unsupervised pre-processing method available with scikit-learn, for instance, you could perform dimensionality reduction using Non-Negative Matrix Factorization (NMF), before training a Cox model.
The most frequently used evaluation metric of survival models is the concordance index (c index, c statistic). It is a measure of rank correlation between predicted risk scores \(\hat{f}\) and observed time points \(y\) that is closely related to Kendall’s τ. It is defined as the ratio of correctly ordered (concordant) pairs to comparable pairs. Two samples \(i\) and \(j\) are comparable if the sample with lower observed time \(y\) experienced an event, i.e., if \(y_j > y_i\) and \(\delta_i = 1\), where \(\delta_i\) is a binary event indicator. A comparable pair \((i, j)\) is concordant if the estimated risk \(\hat{f}\) by a survival model is higher for subjects with lower survival time, i.e., \(\hat{f}_i > \hat{f}_j \land y_j > y_i\), otherwise the pair is discordant. Harrell’s estimator of the c index is implemented in concordance_index_censored.
While Harrell’s concordance index is easy to interpret and compute, it has some shortcomings: 1. it has been shown that it is too optimistic with increasing amount of censoring [1], 2. it is not a useful measure of performance if a specific time range is of primary interest (e.g. predicting death within 2 years).
Since version 0.8, scikit-survival supports an alternative estimator of the concordance index from right-censored survival data, implemented in concordance_index_ipcw, that addresses the first issue.
The second point can be addressed by extending the well known receiver operating characteristic curve (ROC curve) to possibly censored survival times. Given a time point \(t\), we can estimate how well a predictive model can distinguishing subjects who will experience an event by time \(t\) (sensitivity) from those who will not (specificity). The function cumulative_dynamic_auc implements an estimator of the cumulative/dynamic area under the ROC for a given list of time points.
The first part of this notebook will illustrate the first issue with simulated survival data, while the second part will focus on the time-dependent area under the ROC applied to data from a real study.
import numpy as np import matplotlib.pyplot as plt %matplotlib inline import pandas as pd from sklearn.impute import SimpleImputer from sklearn.pipeline import make_pipeline from sklearn.model_selection import train_test_split from sksurv.datasets import load_flchain from sksurv.linear_model import CoxPHSurvivalAnalysis from sksurv.preprocessing import OneHotEncoder from sksurv.util import Surv from sksurv.metrics import (concordance_index_censored, concordance_index_ipcw, cumulative_dynamic_auc) plt.rcParams['figure.figsize'] = [7.2, 4.8]
Harrell’s concordance index is known to be biased upwards if the amount of censoring in the test data is high [1]. Uno et al proposed an alternative estimator of the concordance index that behaves better in such situations. In this section, we are going to apply concordance_index_censored and concordance_index_ipcw to synthetic survival data and compare their results.
We are generating a synthetic biomarker by sampling from a standard normal distribution. For a given hazard ratio, we compute the associated (actual) survival time by drawing from an exponential distribution. The censoring times were generated from a uniform independent distribution \(\textrm{Uniform}(0,\gamma)\), where we choose \(\gamma\) to produce different amounts of censoring.
Since Uno’s estimator is based on inverse probability of censoring weighting, we need to estimate the probability of being censored at a given time point. This probability needs to be non-zero for all observed time points. Therefore, we restrict the test data to all samples with observed time lower than the maximum event time \(\tau\). Usually, one would use the tau argument of concordance_index_ipcw for this, but we apply the selection before to pass identical inputs to concordance_index_censored and concordance_index_ipcw. The estimates of the concordance index are therefore restricted to the interval \([0, \tau]\).
tau
concordance_index_ipcw
concordance_index_censored
import scipy.optimize as opt def generate_marker(n_samples, hazard_ratio, baseline_hazard, rnd): # create synthetic risk score X = rnd.randn(n_samples, 1) # create linear model hazard_ratio = np.array([hazard_ratio]) logits = np.dot(X, np.log(hazard_ratio)) # draw actual survival times from exponential distribution, # refer to Bender et al. (2005), https://doi.org/10.1002/sim.2059 u = rnd.uniform(size=n_samples) time_event = -np.log(u) / (baseline_hazard * np.exp(logits)) # compute the actual concordance in the absence of censoring X = np.squeeze(X) actual = concordance_index_censored(np.ones(n_samples, dtype=bool), time_event, X) return X, time_event, actual[0] def generate_survival_data(n_samples, hazard_ratio, baseline_hazard, percentage_cens, rnd): X, time_event, actual_c = generate_marker(n_samples, hazard_ratio, baseline_hazard, rnd) def get_observed_time(x): rnd_cens = np.random.RandomState(0) # draw censoring times time_censor = rnd_cens.uniform(high=x, size=n_samples) event = time_event < time_censor time = np.where(event, time_event, time_censor) return event, time def censoring_amount(x): event, _ = get_observed_time(x) cens = 1.0 - event.sum() / event.shape[0] return (cens - percentage_cens)**2 # search for upper limit to obtain the desired censoring amount res = opt.minimize_scalar(censoring_amount, method="bounded", bounds=(0, time_event.max())) # compute observed time event, time = get_observed_time(res.x) # upper time limit such that the probability # of being censored is non-zero for `t > tau` tau = time[event].max() y = Surv.from_arrays(event=event, time=time) mask = time < tau X_test = X[mask] y_test = y[mask] return X_test, y_test, y, actual_c def simulation(n_samples, hazard_ratio, n_repeats=100): measures = ("censoring", "Harrel's C", "Uno's C",) data_mean = {} data_std = {} for measure in measures: data_mean[measure] = [] data_std[measure] = [] rnd = np.random.RandomState(seed=987) # iterate over different amount of censoring for cens in (.1, .25, .4, .5, .6, .7): data = {"censoring": [], "Harrel's C": [], "Uno's C": [],} # repeaditly perform simulation for _ in range(n_repeats): # generate data X_test, y_test, y_train, actual_c = generate_survival_data( n_samples, hazard_ratio, baseline_hazard=0.1, percentage_cens=cens, rnd=rnd) # estimate c-index c_harrell = concordance_index_censored(y_test["event"], y_test["time"], X_test) c_uno = concordance_index_ipcw(y_train, y_test, X_test) # save results data["censoring"].append(100. - y_test["event"].sum() * 100. / y_test.shape[0]) data["Harrel's C"].append(actual_c - c_harrell[0]) data["Uno's C"].append(actual_c - c_uno[0]) # aggregate results for key, values in data.items(): data_mean[key].append(np.mean(data[key])) data_std[key].append(np.std(data[key], ddof=1)) data_mean = pd.DataFrame.from_dict(data_mean) data_std = pd.DataFrame.from_dict(data_std) return data_mean, data_std def plot_results(data_mean, data_std, **kwargs): index = pd.Index(data_mean["censoring"].round(3), name="mean percentage censoring") for df in (data_mean, data_std): df.drop("censoring", axis=1, inplace=True) df.index = index ax = data_mean.plot.bar(yerr=data_std, **kwargs) ax.set_ylabel("Actual C - Estimated C") ax.yaxis.grid(True) ax.axhline(0.0, color="gray")
Let us assume a moderate hazard ratio of 2 and generate a small synthetic dataset of 100 samples from which we estimate the concordance index. We repeat this experiment 200 times and plot mean and standard deviation of the difference between the actual (in the absence of censoring) and estimated concordance index.
Since the hazard ratio remains constant and only the amount of censoring changes, we would want an estimator for which the difference between the actual and estimated c to remain approximately constant across simulations.
hazard_ratio = 2.0 ylim = [-0.035, 0.035] mean_1, std_1 = simulation(100, hazard_ratio) plot_results(mean_1, std_1, ylim=ylim)
We can observe that estimates are on average below the actual value, except for the highest amount of censoring, where Harrell’s c begins overestimating the performance (on average).
With such a small dataset, the variance of differences is quite big, so let us increase the amount of data to 1000 and repeat the simulation (this may take some time).
mean_2, std_2 = simulation(1000, hazard_ratio) plot_results(mean_2, std_2, ylim=ylim)
Now we can observe that Harrell’s c begins to overestimate performance starting with approximately 49% censoring while Uno’s c is still underestimating the performance, but is on average very close to the actual performance for large amounts of censoring.
For the final experiment, we double the size of the dataset to 2000 samples and repeat the analysis (this may take several minutes to compute).
mean_3, std_3 = simulation(2000, hazard_ratio) plot_results(mean_3, std_3, ylim=ylim)
The trend we observed in the previous simulation is now even more pronounced. Harrell’s c is becoming more and more overconfident in the performance of the synthetic marker with increasing amount of censoring, while Uno’s c remains stable.
In summary, while the difference between concordance_index_ipcw and concordance_index_censored is negligible for small amounts of censoring, when analyzing survival data with moderate to high amounts of censoring, you might want to consider estimating the performance using concordance_index_ipcw instead of concordance_index_censored.
The area under the receiver operating characteristics curve (ROC curve) is a popular performance measure for binary classification task. In the medical domain, it is often used to determine how well estimated risk scores can separate diseased patients (cases) from healthy patients (controls). Given a predicted risk score \(\hat{f}\), the ROC curve compares the false positive rate (1 - specificity) against the true positive rate (sensitivity) for each possible value of \(\hat{f}\).
When extending the ROC curve to continuous outcomes, in particular survival time, a patient’s disease status is typically not fixed and changes over time: at enrollment a subject is usually healthy, but may be diseased at some later time point. Consequently, sensitivity and specificity become time-dependent measures. Here, we consider cumulative cases and dynamic controls at a given time point \(t\), which gives rise to the time-dependent cumulative/dynamic ROC at time \(t\). Cumulative cases are all individuals that experienced an event prior to or at time \(t\) (\(t_i \leq t\)), whereas dynamic controls are those with \(t_i>t\). By computing the area under the cumulative/dynamic ROC at time \(t\), we can determine how well a model can distinguish subjects who fail by a given time (\(t_i \leq t\)) from subjects who fail after this time (\(t_i>t\)). Hence, it is most relevant if one wants to predict the occurrence of an event in a period up to time \(t\) rather than at a specific time point \(t\).
The cumulative_dynamic_auc function implements an estimator of the cumulative/dynamic area under the ROC at a given list of time points. To illustrate its use, we are going to use data from a study that investigated to which extent the serum immunoglobulin free light chain (FLC) assay can be used predict overall survival. The dataset has 7874 subjects and 9 features; the endpoint is death, which occurred for 2169 subjects (27.5%).
First, we are loading the data and split it into train and test set to evaluate how well markers generalize.
x, y = load_flchain() x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
Serum creatinine measurements are missing for some patients, therefore we are just going to impute these values with the mean using scikit-learn’s SimpleImputer.
SimpleImputer
num_columns = ['age', 'creatinine', 'kappa', 'lambda'] imputer = SimpleImputer().fit(x_train.loc[:, num_columns]) x_train = imputer.transform(x_train.loc[:, num_columns]) x_test = imputer.transform(x_test.loc[:, num_columns])
Similar to Uno’s estimator of the concordance index described above, we need to be a little bit careful when selecting the test data and time points we want to evaluate the ROC at, due to the estimator’s dependence on inverse probability of censoring weighting. First, we are going to check whether the observed time of the test data lies within the observed time range of the training data.
y_events = y_train[y_train['death']] train_min, train_max = y_events["futime"].min(), y_events["futime"].max() y_events = y_test[y_test['death']] test_min, test_max = y_events["futime"].min(), y_events["futime"].max() assert train_min <= test_min < test_max < train_max, \ "time range or test data is not within time range of training data."
When choosing the time points to evaluate the ROC at, it is important to remember to choose the last time point such that the probability of being censored after the last time point is non-zero. In the simulation study above, we set the upper bound to the maximum event time, here we use a more conservative approach by setting the upper bound to the 80% percentile of observed time points, because the censoring rate is quite large at 72.5%. Note that this approach would be appropriate for choosing tau of concordance_index_ipcw too.
times = np.percentile(y["futime"], np.linspace(5, 81, 15)) print(times)
[ 470.3 1259. 1998. 2464.82428571 2979. 3401. 3787.99857143 4051. 4249. 4410.17285714 4543. 4631. 4695. 4781. 4844. ]
We begin by considering individual real-valued features as risk scores without actually fitting a survival model. Hence, we obtain an estimate of how well age, creatinine, kappa FLC, and lambda FLC are able to distinguish cases from controls at each time point.
def plot_cumulative_dynamic_auc(risk_score, label, color=None): auc, mean_auc = cumulative_dynamic_auc(y_train, y_test, risk_score, times) plt.plot(times, auc, marker="o", color=color, label=label) plt.xlabel("days from enrollment") plt.ylabel("time-dependent AUC") plt.axhline(mean_auc, color=color, linestyle="--") plt.legend() for i, col in enumerate(num_columns): plot_cumulative_dynamic_auc(x_test[:, i], col, color="C{}".format(i)) ret = concordance_index_ipcw(y_train, y_test, x_test[:, i], tau=times[-1])
The plot shows the estimated area under the time-dependent ROC at each time point and the average across all time points as dashed line.
We can see that age is overall the most discriminative feature, followed by \(\kappa\) and \(\lambda\) FLC. That fact that age is the strongest predictor of overall survival in the general population is hardly surprising (we have to die at some point after all). More differences become evident when considering time: the discriminative power of FLC decreases at later time points, while that of age increases. The observation for age again follows common sense. In contrast, FLC seems to be a good predictor of death in the near future, but not so much if it occurs decades later.
Most of the time, we do not want to evaluate the discriminatory power of individual features, but how a predictive model performs, based on many features. To demonstrate this, we will fit a survival model to predict the risk of death from the Veterans’ Administration Lung Cancer Trial.
First, we split the data into 80% for training and 20% for testing and use the stratify option to ensure that we do not end up with test data only containing censored death times.
stratify
from sksurv.datasets import load_veterans_lung_cancer va_x, va_y = load_veterans_lung_cancer() va_x_train, va_x_test, va_y_train, va_y_test = train_test_split( va_x, va_y, test_size=0.2, stratify=va_y["Status"], random_state=0 )
Next, we fit a Cox proportional hazards model to the training data.
cph = make_pipeline(OneHotEncoder(), CoxPHSurvivalAnalysis()) cph.fit(va_x_train, va_y_train)
Pipeline(steps=[('onehotencoder', OneHotEncoder()), ('coxphsurvivalanalysis', CoxPHSurvivalAnalysis())])
Using the test data, we want to assess how well the model can distinguish survivors from deceased in weekly intervals, up to 6 months after enrollment.
va_times = np.arange(7, 183, 7) cph_risk_scores = cph.predict(va_x_test) cph_auc, cph_mean_auc = cumulative_dynamic_auc( va_y_train, va_y_test, cph_risk_scores, va_times ) plt.plot(va_times, cph_auc, marker="o") plt.axhline(cph_mean_auc, linestyle="--") plt.xlabel("days from enrollment") plt.ylabel("time-dependent AUC") plt.grid(True)
The plot shows that the model is doing moderately well on average with an AUC of ~0.72 (dashed line). However, there is a clear difference in performance between the first and second half of the time range. The performance on the test data increases up to 56 days from enrollment, remains high until 98 days and quickly drops thereafter. Thus, we can conclude that the model is most effective in predicting death in the medium-term.
The downside of Cox proportional hazards model is that it can only predict a risk score that is independent of time (due to the built-in proportional hazards assumption). Therefore, a single predicted risk score needs to work well for every time point. In contrast, a Random Survival Forest does not have this restriction. So let’s fit such a model to the training data.
from sksurv.ensemble import RandomSurvivalForest rsf = make_pipeline( OneHotEncoder(), RandomSurvivalForest(n_estimators=100, min_samples_leaf=7, random_state=0) ) rsf.fit(va_x_train, va_y_train)
Pipeline(steps=[('onehotencoder', OneHotEncoder()), ('randomsurvivalforest', RandomSurvivalForest(min_samples_leaf=7, random_state=0))])
For prediction, we do not call predict, which returns a time-independent risk score, but call predict_cumulative_hazard_function, which returns a risk function over time for each patient. We obtain the time-dependent risk scores by evaluating each cumulative hazard function at the time points we are interested in.
rsf_chf_funcs = rsf.predict_cumulative_hazard_function( va_x_test, return_array=False) rsf_risk_scores = np.row_stack([chf(va_times) for chf in rsf_chf_funcs]) rsf_auc, rsf_mean_auc = cumulative_dynamic_auc( va_y_train, va_y_test, rsf_risk_scores, va_times )
Now, we can compare the result with the predictive performance of the Cox proportional hazards model from above.
plt.plot(va_times, cph_auc, "o-", label="CoxPH (mean AUC = {:.3f})".format(cph_mean_auc)) plt.plot(va_times, rsf_auc, "o-", label="RSF (mean AUC = {:.3f})".format(rsf_mean_auc)) plt.xlabel("days from enrollment") plt.ylabel("time-dependent AUC") plt.legend(loc="lower center") plt.grid(True)
Indeed, the Random Survival Forest performs slightly better on average, mostly due to the better performance in the intervals 25–50 days, and 112–147 days. Above 147 days, it actually is doing worse. This shows that the mean AUC is convenient to assess overall performance, but it can hide interesting characteristics that only become visible when looking at the AUC at individual time points.
I hope this notebook helped you to understand some of the pitfalls when estimating the performance of markers and models from right-censored survival data. We illustrated that Harrell’s estimator of the concordance index is biased when the amount of censoring is high, and that Uno’s estimator is more appropriate in this situation. Finally, we demonstrated that the time-dependent area under the ROC is a very useful tool when we want to predict the occurrence of an event in a period up to time \(t\) rather than at a specific time point \(t\).
Cox’s proportional hazard’s model is often an appealing model, because its coefficients can be interpreted in terms of hazard ratio, which often provides valuable insight. However, if we want to estimate the coefficients of many features, the standard Cox model falls apart, because internally it tries to invert a matrix that becomes non-singular due to correlations among features.
This mathematical problem can be avoided by adding a \(\ell_2\) penalty term on the coefficients that shrinks the coefficients to zero. The modified objective has the form
where \(\mathrm{PL}(\beta)\) is the partial likelihood function of the Cox model, \(\beta_1,\ldots,\beta_p\) are the coefficients for \(p\) features, and \(\alpha \geq 0\) is a hyper-parameter that controls the amount of shrinkage. The resulting objective is often referred to as ridge regression. If \(\lambda\) is set to zero, we obtain the standard, unpenalized Cox model.
import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sksurv.datasets import load_breast_cancer from sksurv.linear_model import CoxPHSurvivalAnalysis, CoxnetSurvivalAnalysis from sksurv.preprocessing import OneHotEncoder from sklearn.model_selection import GridSearchCV, KFold from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler
To demonstrate the use of penalized Cox models we are going to use the breast cancer data, which contains the expression levels of 76 genes, age, estrogen receptor status (er), tumor size and grade for 198 individuals. The objective is to predict the time to distant metastasis.
er
First, we load the data and perform one-hot encoding of categorical variables er and grade.
grade
X, y = load_breast_cancer() Xt = OneHotEncoder().fit_transform(X) Xt.round(2).head()
5 rows × 82 columns
Let us begin by fitting a penalized Cox model to various values of \(\alpha\) using sksurv.linear_model.CoxPHSurvivalAnalysis and recording the coefficients we obtained for each \(\alpha\).
alphas = 10. ** np.linspace(-4, 4, 50) coefficients = {} cph = CoxPHSurvivalAnalysis() for alpha in alphas: cph.set_params(alpha=alpha) cph.fit(Xt, y) key = round(alpha, 5) coefficients[key] = cph.coef_ coefficients = (pd.DataFrame .from_dict(coefficients) .rename_axis(index="feature", columns="alpha") .set_index(Xt.columns))
Now, we can inspect how the coefficients change for varying \(\alpha\).
def plot_coefficients(coefs, n_highlight): _, ax = plt.subplots(figsize=(9, 6)) n_features = coefs.shape[0] alphas = coefs.columns for row in coefs.itertuples(): ax.semilogx(alphas, row[1:], ".-", label=row.Index) alpha_min = alphas.min() top_coefs = coefs.loc[:, alpha_min].map(abs).sort_values().tail(n_highlight) for name in top_coefs.index: coef = coefs.loc[name, alpha_min] plt.text( alpha_min, coef, name + " ", horizontalalignment="right", verticalalignment="center" ) ax.yaxis.set_label_position("right") ax.yaxis.tick_right() ax.grid(True) ax.set_xlabel("alpha") ax.set_ylabel("coefficient")
plot_coefficients(coefficients, n_highlight=5)
We can see that if the penalty has a large weight (to the right), all coefficients are shrunk almost to zero. As the penalty’s weight is decreased, the coefficients’ value increases. We can also observe that the paths for X203391_at and tumor grade quickly separate themselves from the remaining coefficients, which indicates that this particular gene expression level and tumor grade are important predictive factors for time to distant metastasis.
X203391_at
While the \(\ell_2\) (ridge) penalty does solve the mathematical problem of fitting a Cox model, we would still need to measure the expression levels of all 76 genes to make predictions. Ideally, we would like to select a small subset of features that are most predictive and ignore the remaining gene expression levels. This is precisely what the LASSO (Least Absolute Shrinkage and Selection Operator) penalty does. Instead of shrinking coefficients to zero it does a type of continuous subset selection, where a subset of coefficients are set to zero and are effectively excluded. This reduces the number of features that we would need to record for prediction. In mathematical terms, the \(\ell_2\) penalty is replaced by a \(\ell_1\) penalty, which leads to the optimization problem
The main challenge is that we cannot directly control the number of features that get selected, but the value of \(\alpha\) implicitly determines the number of features. Thus, we need a data-driven way to select a suitable \(\alpha\) and obtain a parsimonious model. We can do this by first computing the \(\alpha\) that would ignore all features (coefficients are all zero) and then incrementally decrease its value, let’s say until we reach 1% of the original value. This has been implemented in sksurv.linear_model.CoxnetSurvivalAnalysis by specifying l1_ratio=1.0 to use the LASSO penalty and alpha_min_ratio=0.01 to search for 100 \(\alpha\) values up to 1% of the estimated maximum.
l1_ratio=1.0
alpha_min_ratio=0.01
cox_lasso = CoxnetSurvivalAnalysis(l1_ratio=1.0, alpha_min_ratio=0.01) cox_lasso.fit(Xt, y)
CoxnetSurvivalAnalysis(alpha_min_ratio=0.01, l1_ratio=1.0)
coefficients_lasso = pd.DataFrame( cox_lasso.coef_, index=Xt.columns, columns=np.round(cox_lasso.alphas_, 5) ) plot_coefficients(coefficients_lasso, n_highlight=5)
The figure shows that the LASSO penalty indeed selects a small subset of features for large \(\alpha\) (to the right) with only two features (purple and yellow line) being non-zero. As \(\alpha\) decreases, more and more features become active and are assigned a non-zero coefficient until the entire set of features is used (to the left left). Similar to the plot above for the ridge penalty, the path for X203391_at stands out, indicating its importance in breast cancer. However, the overall most important factor seems to be a positive estrogen receptor status (er).
The LASSO is a great tool to select a subset of discriminative features, but it has two main drawbacks. First, it cannot select more features than number of samples in the training data, which is problematic when dealing with very high-dimensional data. Second, if data contains a group of features that are highly correlated, the LASSO penalty is going to randomly choose one feature from this group. The Elastic Net penalty overcomes these problems by using a weighted combination of the \(\ell_1\) and \(\ell_2\) penalty by solving:
where \(r \in [0; 1[\) is the relative weight of the \(\ell_1\) and \(\ell_2\) penalty. The Elastic Net penalty combines the subset selection property of the LASSO with the regularization strength of the Ridge penalty. This leads to better stability compared to the LASSO penalized model. For a group of highly correlated features, the latter would choose one feature randomly, whereas the Elastic Net penalized model would tend to select all. Usually, it is sufficient to give the \(\ell_2\) penalty only a small weight to improve stability of the LASSO, e.g. by setting \(r = 0.9\).
As for the LASSO, the weight \(\alpha\) implicitly determines the size of the selected subset, and usually has to be estimated in a data-driven manner.
cox_elastic_net = CoxnetSurvivalAnalysis(l1_ratio=0.9, alpha_min_ratio=0.01) cox_elastic_net.fit(Xt, y)
CoxnetSurvivalAnalysis(alpha_min_ratio=0.01, l1_ratio=0.9)
coefficients_elastic_net = pd.DataFrame( cox_elastic_net.coef_, index=Xt.columns, columns=np.round(cox_elastic_net.alphas_, 5) ) plot_coefficients(coefficients_elastic_net, n_highlight=5)
Previously, we focused on the estimated coefficients to get some insight into which features are important for estimating time to distant metastasis. However, for prediction, we need to pick one particular \(\alpha\), and the subset of features it implies. Here, we are going to use cross-validation to determine which subset and \(\alpha\) generalizes best.
Before we can use GridSearchCV, we need to determine the set of \(\alpha\) which we want to evaluate. To do this, we fit a penalized Cox model to the whole data and retrieve the estimated set of alphas. Since, we are only interested in alphas and not the coefficients, we can use only a few iterations for improved speed. Note that we are using StandardScaler to account for scale differences among features and allow direct comparison of coefficients.
import warnings from sklearn.exceptions import ConvergenceWarning from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler coxnet_pipe = make_pipeline( StandardScaler(), CoxnetSurvivalAnalysis(l1_ratio=0.9, alpha_min_ratio=0.01, max_iter=100) ) warnings.simplefilter("ignore", ConvergenceWarning) coxnet_pipe.fit(Xt, y)
Pipeline(steps=[('standardscaler', StandardScaler()), ('coxnetsurvivalanalysis', CoxnetSurvivalAnalysis(alpha_min_ratio=0.01, l1_ratio=0.9, max_iter=100))])
Using the estimated set of alphas, we perform 5 fold cross-validation to estimate the performance – in terms of concordance index – for each \(\alpha\).
Note: this can take a while.
estimated_alphas = coxnet_pipe.named_steps["coxnetsurvivalanalysis"].alphas_ cv = KFold(n_splits=5, shuffle=True, random_state=0) gcv = GridSearchCV( make_pipeline(StandardScaler(), CoxnetSurvivalAnalysis(l1_ratio=0.9)), param_grid={"coxnetsurvivalanalysis__alphas": [[v] for v in estimated_alphas]}, cv=cv, error_score=0.5, n_jobs=4).fit(Xt, y) cv_results = pd.DataFrame(gcv.cv_results_)
We can visualize the results by plotting the mean concordance index and its standard deviation across all folds for each \(\alpha\).
alphas = cv_results.param_coxnetsurvivalanalysis__alphas.map(lambda x: x[0]) mean = cv_results.mean_test_score std = cv_results.std_test_score fig, ax = plt.subplots(figsize=(9, 6)) ax.plot(alphas, mean) ax.fill_between(alphas, mean - std, mean + std, alpha=.15) ax.set_xscale("log") ax.set_ylabel("concordance index") ax.set_xlabel("alpha") ax.axvline(gcv.best_params_["coxnetsurvivalanalysis__alphas"][0], c="C1") ax.axhline(0.5, color="grey", linestyle="--") ax.grid(True)
The figure shows that there is a range for \(\alpha\) to the right where it is too large and sets all coefficients to zero, as indicated by the 0.5 concordance index of a purely random model. On the other extreme, if \(\alpha\) becomes too small, too many features enter the model and the performance approaches that of a random model again. The sweet spot (orange line) is somewhere in the middle. Let’s inspect that model.
best_model = gcv.best_estimator_.named_steps["coxnetsurvivalanalysis"] best_coefs = pd.DataFrame( best_model.coef_, index=Xt.columns, columns=["coefficient"] ) non_zero = np.sum(best_coefs.iloc[:, 0] != 0) print("Number of non-zero coefficients: {}".format(non_zero)) non_zero_coefs = best_coefs.query("coefficient != 0") coef_order = non_zero_coefs.abs().sort_values("coefficient").index _, ax = plt.subplots(figsize=(6, 8)) non_zero_coefs.loc[coef_order].plot.barh(ax=ax, legend=False) ax.set_xlabel("coefficient") ax.grid(True)
Number of non-zero coefficients: 22
The model selected a total of 21 features, and it deemed X204540_at to be the most important one, followed by X203391_at and positive estrogen receptor status:
X204540_at
Having selected a particular \(\alpha\), we can perform prediction, either in terms of risk score using the predict function or in terms of survival or cumulative hazard function. For the latter two, we first need to re-fit the model with fit_baseline_model enabled.
fit_baseline_model
coxnet_pred = make_pipeline( StandardScaler(), CoxnetSurvivalAnalysis(l1_ratio=0.9, fit_baseline_model=True) ) coxnet_pred.set_params(**gcv.best_params_) coxnet_pred.fit(Xt, y)
Pipeline(steps=[('standardscaler', StandardScaler()), ('coxnetsurvivalanalysis', CoxnetSurvivalAnalysis(alphas=[0.03860196504106796], fit_baseline_model=True, l1_ratio=0.9))])
For instance, we can now select a patient and determine how positive or negative estrogen receptor status would affect the survival function.
surv_fns = coxnet_pred.predict_survival_function(Xt) time_points = np.quantile(y["t.tdm"], np.linspace(0, 0.6, 100)) legend_handles = [] legend_labels = [] _, ax = plt.subplots(figsize=(9, 6)) for fn, label in zip(surv_fns, Xt.loc[:, "er=positive"].astype(int)): line, = ax.step(time_points, fn(time_points), where="post", color="C{:d}".format(label), alpha=0.5) if len(legend_handles) <= label: name = "positive" if label == 1 else "negative" legend_labels.append(name) legend_handles.append(line) ax.legend(legend_handles, legend_labels) ax.set_xlabel("time") ax.set_ylabel("Survival probability") ax.grid(True)
We can observe that patients with positive estrogen receptor status tend to have a better prognosis.
This notebook demonstrates how to use Random Survival Forests introduced in scikit-survival 0.11.
As it’s popular counterparts for classification and regression, a Random Survival Forest is an ensemble of tree-based learners. A Random Survival Forest ensures that individual trees are de-correlated by 1) building each tree on a different bootstrap sample of the original training data, and 2) at each node, only evaluate the split criterion for a randomly selected subset of features and thresholds. Predictions are formed by aggregating predictions of individual trees in the ensemble.
To demonstrate Random Survival Forest, we are going to use data from the German Breast Cancer Study Group (GBSG-2) on the treatment of node-positive breast cancer patients. It contains data on 686 women and 8 prognostic factors: 1. age, 2. estrogen receptor (estrec), 3. whether or not a hormonal therapy was administered (horTh), 4. menopausal status (menostat), 5. number of positive lymph nodes (pnodes), 6. progesterone receptor (progrec), 7. tumor size (tsize, 8. tumor grade (tgrade).
estrec
horTh
menostat
pnodes
progrec
tsize
tgrade
The goal is to predict recurrence-free survival time.
import pandas as pd import matplotlib.pyplot as plt import numpy as np %matplotlib inline from sklearn.model_selection import train_test_split from sklearn.preprocessing import OrdinalEncoder from sksurv.datasets import load_gbsg2 from sksurv.preprocessing import OneHotEncoder from sksurv.ensemble import RandomSurvivalForest
First, we need to load the data and transform it into numeric values.
X, y = load_gbsg2() grade_str = X.loc[:, "tgrade"].astype(object).values[:, np.newaxis] grade_num = OrdinalEncoder(categories=[["I", "II", "III"]]).fit_transform(grade_str) X_no_grade = X.drop("tgrade", axis=1) Xt = OneHotEncoder().fit_transform(X_no_grade) Xt = np.column_stack((Xt.values, grade_num)) feature_names = X_no_grade.columns.tolist() + ["tgrade"]
Next, the data is split into 75% for training and 25% for testing, so we can determine how well our model generalizes.
random_state = 20 X_train, X_test, y_train, y_test = train_test_split( Xt, y, test_size=0.25, random_state=random_state)
Several split criterion have been proposed in the past, but the most widespread one is based on the log-rank test, which you probably know from comparing survival curves among two or more groups. Using the training data, we fit a Random Survival Forest comprising 1000 trees.
rsf = RandomSurvivalForest(n_estimators=1000, min_samples_split=10, min_samples_leaf=15, max_features="sqrt", n_jobs=-1, random_state=random_state) rsf.fit(X_train, y_train)
RandomSurvivalForest(max_features='sqrt', min_samples_leaf=15, min_samples_split=10, n_estimators=1000, n_jobs=-1, random_state=20)
We can check how well the model performs by evaluating it on the test data.
rsf.score(X_test, y_test)
0.6759696016771488
This gives a concordance index of 0.68, which is a good a value and matches the results reported in the Random Survival Forests paper.
For prediction, a sample is dropped down each tree in the forest until it reaches a terminal node. Data in each terminal is used to non-parametrically estimate the survival and cumulative hazard function using the Kaplan-Meier and Nelson-Aalen estimator, respectively. In addition, a risk score can be computed that represents the expected number of events for one particular terminal node. The ensemble prediction is simply the average across all trees in the forest.
Let’s first select a couple of patients from the test data according to the number of positive lymph nodes and age.
a = np.empty(X_test.shape[0], dtype=[("age", float), ("pnodes", float)]) a["age"] = X_test[:, 0] a["pnodes"] = X_test[:, 4] sort_idx = np.argsort(a, order=["pnodes", "age"]) X_test_sel = pd.DataFrame( X_test[np.concatenate((sort_idx[:3], sort_idx[-3:]))], columns=feature_names) X_test_sel
The predicted risk scores indicate that risk for the last three patients is quite a bit higher than that of the first three patients.
pd.Series(rsf.predict(X_test_sel))
0 91.477609 1 102.897552 2 75.883786 3 170.502092 4 171.210066 5 148.691835 dtype: float64
We can have a more detailed insight by considering the predicted survival function. It shows that the biggest difference occurs roughly within the first 750 days.
surv = rsf.predict_survival_function(X_test_sel, return_array=True) for i, s in enumerate(surv): plt.step(rsf.event_times_, s, where="post", label=str(i)) plt.ylabel("Survival probability") plt.xlabel("Time in days") plt.legend() plt.grid(True)
Alternatively, we can also plot the predicted cumulative hazard function.
surv = rsf.predict_cumulative_hazard_function(X_test_sel, return_array=True) for i, s in enumerate(surv): plt.step(rsf.event_times_, s, where="post", label=str(i)) plt.ylabel("Cumulative hazard") plt.xlabel("Time in days") plt.legend() plt.grid(True)
The implementation is based on scikit-learn’s Random Forest implementation and inherits many features, such as building trees in parallel. What’s currently missing is feature importances via the feature_importance_ attribute. This is due to the way scikit-learn’s implementation computes importances. It relies on a measure of impurity for each child node, and defines importance as the amount of decrease in impurity due to a split. For traditional regression, impurity would be measured by the variance, but for survival analysis there is no per-node impurity measure due to censoring. Instead, one could use the magnitude of the log-rank test statistic as an importance measure, but scikit-learn’s implementation doesn’t seem to allow this.
feature_importance_
Fortunately, this is not a big concern though, as scikit-learn’s definition of feature importance is non-standard and differs from what Leo Breiman proposed in the original Random Forest paper. Instead, we can use permutation to estimate feature importance, which is preferred over scikit-learn’s definition. This is implemented in the ELI5 library, which is fully compatible with scikit-survival.
import eli5 from eli5.sklearn import PermutationImportance perm = PermutationImportance(rsf, n_iter=15, random_state=random_state) perm.fit(X_test, y_test) eli5.show_weights(perm, feature_names=feature_names)
The result shows that the number of positive lymph nodes (pnodes) is by far the most important feature. If its relationship to survival time is removed (by random shuffling), the concordance index on the test data drops on average by 0.0676 points. Again, this agrees with the results from the original Random Survival Forests paper.
Gradient Boosting does not refer to one particular model, but a versatile framework to optimize many loss functions. It follows the strength in numbers principle by combining the predictions of multiple base learners to obtain a powerful overall model. The base learners are often very simple models that are only slightly better than random guessing, which is why they are also referred to as weak learners. The predictions are combined in an additive manner, where the addition of each base model improves (or “boosts”) the overall model. Therefore, the overall model \(f\) is an additive model of the form:
where \(M > 0\) denotes the number of base learners, and \(\beta_m \in \mathbb{R}\) is a weighting term. The function \(g\) refers to a base learner and is parameterized by the vector \({\theta}\). Individual base learners differ in the configuration of their parameters \({\theta}\), which is indicated by a subscript \(m\).
A gradient boosted model is similar to a Random Survival Forest, in the sense that it relies on multiple base learners to produce an overall prediction, but differs in how those are combined. While a Random Survival Forest fits a set of Survival Trees independently and then averages their predictions, a gradient boosted model is constructed sequentially in a greedy stagewise fashion.
Depending on the loss function to be minimized and base learner used, different models arise. sksurv.ensemble.GradientBoostingSurvivalAnalysis implements gradient boosting with regression tree base learner, and sksurv.ensemble.ComponentwiseGradientBoostingSurvivalAnalysis uses component-wise least squares as base learner. The former is very versatile and can account for complicated non-linear relationships between features and time to survival. When using component-wise least squares as base learner, the final model will be a linear model, but only a small subset of features will be selected, similar to the LASSO penalized Cox model.
The loss function can be specified via the loss argument loss; the default loss function is the partial likelihood loss of Cox’s proportional hazards model (coxph). Therefore, the objective is to maximize the log partial likelihood function, but replacing the traditional linear model \(\mathbf{x}^\top \beta\) with the additive model \(f(\mathbf{x})\):
loss
coxph
import numpy as np import matplotlib.pyplot as plt import pandas as pd %matplotlib inline from sklearn.model_selection import train_test_split from sksurv.datasets import load_breast_cancer from sksurv.ensemble import ComponentwiseGradientBoostingSurvivalAnalysis from sksurv.ensemble import GradientBoostingSurvivalAnalysis from sksurv.preprocessing import OneHotEncoder
To demonstrate its use we are going to use the breast cancer data, which contains the expression levels of 76 genes, age, estrogen receptor status (er), tumor size and grade for 198 individuals. The objective is to predict the time to distant metastasis.
X, y = load_breast_cancer() Xt = OneHotEncoder().fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(Xt, y, test_size=0.25, random_state=0)
Next, we are using gradient boosting on Cox’s partial likelihood with regression trees base learners, which we restrict to using only a single split (so-called stumps).
est_cph_tree = GradientBoostingSurvivalAnalysis( n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0 ) est_cph_tree.fit(X_train, y_train) cindex = est_cph_tree.score(X_test, y_test) print(round(cindex, 3))
0.756
This model achieves a concordance index of 0.756 on the test data. Let’s see how the test performance changes with the ensemble size (n_estimators).
n_estimators
scores_cph_tree = {} est_cph_tree = GradientBoostingSurvivalAnalysis( learning_rate=1.0, max_depth=1, random_state=0 ) for i in range(1, 31): n_estimators = i * 5 est_cph_tree.set_params(n_estimators=n_estimators) est_cph_tree.fit(X_train, y_train) scores_cph_tree[n_estimators] = est_cph_tree.score(X_test, y_test)
x, y = zip(*scores_cph_tree.items()) plt.plot(x, y) plt.xlabel("n_estimator") plt.ylabel("concordance index") plt.grid(True)
We can see that the performance quickly improves, but also that the performance starts to decrease if the ensemble becomes too big.
Let’s repeat the analysis using component-wise least squares base learners.
scores_cph_ls = {} est_cph_ls = ComponentwiseGradientBoostingSurvivalAnalysis( learning_rate=1.0, random_state=0 ) for i in range(1, 31): n_estimators = i * 10 est_cph_ls.set_params(n_estimators=n_estimators) est_cph_ls.fit(X_train, y_train) scores_cph_ls[n_estimators] = est_cph_ls.score(X_test, y_test)
x, y = zip(*scores_cph_ls.items()) plt.plot(x, y) plt.xlabel("n_estimator") plt.ylabel("concordance index") plt.grid(True)
The performance increase is much slower here and its maximum performance seems to be below that of the ensemble of tree-based learners. This is not surprising, because with component-wise least squares base learners the overall ensemble is a linear model, whereas with tree-based learners it will be a non-linear model.
The coefficients of the model can be retrieved as follows:
coef = pd.Series(est_cph_ls.coef_, ["Intercept"] + Xt.columns.tolist()) print("Number of non-zero coefficients:", (coef != 0).sum()) coef_nz = coef[coef != 0] coef_order = coef_nz.abs().sort_values(ascending=False).index coef_nz.loc[coef_order]
Number of non-zero coefficients: 9
er=positive -0.837549 X207118_s_at 0.239161 grade=unkown -0.234126 size 0.214384 X204540_at 0.094421 X204014_at -0.091377 X216103_at -0.086147 X221916_at -0.081565 grade=intermediate 0.065916 dtype: float64
Despite using hundreds of iterations, the resulting model is very parsimonious and easy to interpret.
The Accelerated Failure Time (AFT) model is an alternative to Cox’s proportional hazards model. The latter assumes that features only influence the hazard function via a constant multiplicative factor. In contrast, features in an AFT model can accelerate or decelerate the time to an event by a constant factor. The figure below depicts the predicted hazard functions of a proportional hazards model in blue and that of an AFT model in orange.
We can see that the hazard remains constant for the proportional hazards model and varies for the AFT model.
The objective function in an AFT model can be expressed as a weighted least squares problem with respect to the logarithm of the survival time:
The weight \(\omega_i\) associated with the \(i\)-th sample is the inverse probability of being censored after time \(y_i\):
where \(\hat{G}(\cdot)\) is an estimator of the censoring survivor function.
Such a model can be fit with sksurv.ensemble.GradientBoostingSurvivalAnalysis or sksurv.ensemble.ComponentwiseGradientBoostingSurvivalAnalysis by specifying the loss="ipcwls" argument.
loss="ipcwls"
est_aft_ls = ComponentwiseGradientBoostingSurvivalAnalysis( loss="ipcwls", n_estimators=300, learning_rate=1.0, random_state=0 ).fit(X_train, y_train) cindex = est_aft_ls.score(X_test, y_test) print(round(cindex, 3))
0.722
The most important parameter in gradient boosting is the number of base learner to use (n_estimators argument). A higher number will lead to a more complex model. However, this can easily lead to overfitting on the training data. The easiest way would be to just use less base estimators, but there are three alternatives to combat overfitting:
Use a learning_rate less than 1 to restrict the influence of individual base learners, similar to the Ridge penalty.
learning_rate
Use a non-zero dropout_rate, which forces base learners to also account for some of the previously fitted base learners to be missing.
dropout_rate
Use subsample less than 1 such that each iteration only a portion of the training data is used. This is also known as stochastic gradient boosting.
subsample
n_estimators = [i * 5 for i in range(1, 21)] estimators = { "no regularization": GradientBoostingSurvivalAnalysis( learning_rate=1.0, max_depth=1, random_state=0 ), "learning rate": GradientBoostingSurvivalAnalysis( learning_rate=0.1, max_depth=1, random_state=0 ), "dropout": GradientBoostingSurvivalAnalysis( learning_rate=1.0, dropout_rate=0.1, max_depth=1, random_state=0 ), "subsample": GradientBoostingSurvivalAnalysis( learning_rate=1.0, subsample=0.5, max_depth=1, random_state=0 ), } scores_reg = {k: [] for k in estimators.keys()} for n in n_estimators: for name, est in estimators.items(): est.set_params(n_estimators=n) est.fit(X_train, y_train) cindex = est.score(X_test, y_test) scores_reg[name].append(cindex) scores_reg = pd.DataFrame(scores_reg, index=n_estimators)
ax = scores_reg.plot(xlabel="n_estimators", ylabel="concordance index") ax.grid(True)
The plot reveals that using dropout or a learning rate are most effective in avoiding overfitting. Moreover, the learning rate and ensemble size are strongly connected, choosing smaller a learning rate suggests increasing n_estimators. Therefore, it is recommended to use a relatively small learning rate and select the number of estimators via early stopping. Note that we can also apply multiple types of regularization, such as regularization by learning rate and subsampling. Since not all training data is used, this allows using the left-out data to evaluate whether we should continue adding more base learners or stop training.
class EarlyStoppingMonitor: def __init__(self, window_size, max_iter_without_improvement): self.window_size = window_size self.max_iter_without_improvement = max_iter_without_improvement self._best_step = -1 def __call__(self, iteration, estimator, args): # continue training for first self.window_size iterations if iteration < self.window_size: return False # compute average improvement in last self.window_size iterations. # oob_improvement_ is the different in negative log partial likelihood # between the previous and current iteration. start = iteration - self.window_size + 1 end = iteration + 1 improvement = np.mean(estimator.oob_improvement_[start:end]) if improvement > 1e-6: self._best_step = iteration return False # continue fitting # stop fitting if there was no improvement # in last max_iter_without_improvement iterations diff = iteration - self._best_step return diff >= self.max_iter_without_improvement est_early_stopping = GradientBoostingSurvivalAnalysis( n_estimators=1000, learning_rate=0.05, subsample=0.5, max_depth=1, random_state=0 ) monitor = EarlyStoppingMonitor(25, 50) est_early_stopping.fit(X_train, y_train, monitor=monitor) print("Fitted base learners:", est_early_stopping.n_estimators_) cindex = est_early_stopping.score(X_test, y_test) print("Performance on test set", round(cindex, 3))
Fitted base learners: 119 Performance on test set 0.703
The monitor looks at the average improvement of the last 25 iterations, and if it was negative for the last 50 iterations, it will abort training. In this case, this occurred after 119 iterations. We can plot the improvement per base learner and the moving average.
improvement = pd.Series( est_early_stopping.oob_improvement_, index=np.arange(1, 1 + len(est_early_stopping.oob_improvement_)) ) ax = improvement.plot(xlabel="iteration", ylabel="oob improvement") ax.axhline(0.0, linestyle="--", color="gray") cutoff = len(improvement) - monitor.max_iter_without_improvement ax.axvline(cutoff, linestyle="--", color="C3") _ = improvement.rolling(monitor.window_size).mean().plot(ax=ax, linestyle=":")
This guide demonstrates how to use the efficient implementation of Survival Support Vector Machines, which is an extension of the standard Support Vector Machine to right-censored time-to-event data. Its main advantage is that it can account for complex, non-linear relationships between features and survival via the so-called kernel trick. A kernel function implicitly maps the input features into high-dimensional feature spaces where survival can be described by a hyperplane. This makes Survival Support Vector Machines extremely versatile and applicable to a wide a range of data. A popular example for such a kernel function is the Radial Basis Function.
Survival analysis in the context of Support Vector Machines can be described in two different ways:
As a ranking problem: the model learns to assign samples with shorter survival times a lower rank by considering all possible pairs of samples in the training data.
As a regression problem: the model learns to directly predict the (log) survival time.
In both cases, the disadvantage is that predictions cannot be easily related to standard quantities in survival analysis, namely survival function and cumulative hazard function. Moreover, they have to retain a copy of the training data to do predictions.
Let’s start by taking a closer look at the Linear Survival Support Vector Machine, which does not allow selecting a specific kernel function, but can be fitted faster than the more generic Kernel Survival Support Vector Machine.
The class sksurv.svm.FastSurvivalSVM is used to train a linear Survival Support Vector Machine. Training data consists of \(n\) triplets \((\mathbf{x}_i, y_i, \delta_i)\), where \(\mathbf{x}_i\) is a \(d\)-dimensional feature vector, \(y_i > 0\) the survival time or time of censoring, and \(\delta_i \in \{0,1\}\) the binary event indicator. Using the training data, the objective is to minimize the following function:
The hyper-parameter \(\alpha > 0\) determines the amount of regularization to apply: a smaller value increases the amount of regularization and a higher value reduces the amount of regularization. The hyper-parameter \(r \in [0; 1]\) determines the trade-off between the ranking objective and the regression objective. If \(r = 1\) it reduces to the ranking objective, and if \(r = 0\) to the regression objective. If the regression objective is used, it is advised to log-transform the observed time first.
The class sksurv.svm.FastSurvivalSVM adheres to interfaces used in scikit-learn and thus it is possible to combine it with auxiliary classes and functions from scikit-learn. In this example, we are going to use the ranking objective (\(r = 1\)) and GridSearchCV to determine the best setting for the hyper-parameter \(\alpha\) on the Veteran’s Lung Cancer data.
First, we have to import the classes we are going to use.
%matplotlib inline import numpy as np import matplotlib.pyplot as plt import pandas import seaborn as sns from sklearn.model_selection import ShuffleSplit, GridSearchCV from sksurv.datasets import load_veterans_lung_cancer from sksurv.column import encode_categorical from sksurv.metrics import concordance_index_censored from sksurv.svm import FastSurvivalSVM sns.set_style("whitegrid")
Next, we load data of the Veteran’s Administration Lung Cancer Trial from disk and convert it to numeric values. The data consists of 137 patients and 6 features. The primary outcome measure is death (Status, Survival_in_days). The original data can be retrieved from http://lib.stat.cmu.edu/datasets/veteran.
data_x, y = load_veterans_lung_cancer() x = encode_categorical(data_x)
data_x is a data frame containing the features, and y is a structured array containing the event indicator \(\delta_i\), as boolean, and the survival/censoring time \(y_i\) for training.
data_x
y
Now, we are essentially ready to start training, but before let’s determine what the amount of censoring for this data is and plot the survival/censoring times.
n_censored = y.shape[0] - y["Status"].sum() print("%.1f%% of records are censored" % (n_censored / y.shape[0] * 100))
6.6% of records are censored
plt.figure(figsize=(9, 6)) val, bins, patches = plt.hist((y["Survival_in_days"][y["Status"]], y["Survival_in_days"][~y["Status"]]), bins=30, stacked=True) _ = plt.legend(patches, ["Time of Death", "Time of Censoring"])
First, we need to create an initial model with default parameters that is subsequently used in the grid search.
estimator = FastSurvivalSVM(max_iter=1000, tol=1e-5, random_state=0)
Next, we define a function for evaluating the performance of models during grid search. We use Harrell’s concordance index.
def score_survival_model(model, X, y): prediction = model.predict(X) result = concordance_index_censored(y['Status'], y['Survival_in_days'], prediction) return result[0]
The last part of the setup specifies the set of parameters we want to try and how many repetitions of training and testing we want to perform for each parameter setting. In the end, the parameters that on average performed best across all test sets (100 in this case) are selected. GridSearchCV can leverage multiple cores by evaluating multiple parameter settings concurrently (4 jobs in this example).
param_grid = {'alpha': 2. ** np.arange(-12, 13, 2)} cv = ShuffleSplit(n_splits=100, test_size=0.5, random_state=0) gcv = GridSearchCV(estimator, param_grid, scoring=score_survival_model, n_jobs=4, refit=False, cv=cv)
Finally, start the hyper-parameter search. This can take a while since a total of 13 * 100 = 1300 fits have to be evaluated.
13 * 100 = 1300
import warnings warnings.filterwarnings("ignore", category=FutureWarning) gcv = gcv.fit(x, y)
Let’s check what is the best average performance across 100 random train/test splits we got and the corresponding hyper-parameters.
round(gcv.best_score_, 3), gcv.best_params_
(0.72, {'alpha': 0.00390625})
Finally, we retrieve all 100 test scores for each parameter setting and visualize their distribution by box plots.
def plot_performance(gcv): n_splits = gcv.cv.n_splits cv_scores = {"alpha": [], "test_score": [], "split": []} order = [] for i, params in enumerate(gcv.cv_results_["params"]): name = "%.5f" % params["alpha"] order.append(name) for j in range(n_splits): vs = gcv.cv_results_["split%d_test_score" % j][i] cv_scores["alpha"].append(name) cv_scores["test_score"].append(vs) cv_scores["split"].append(j) df = pandas.DataFrame.from_dict(cv_scores) _, ax = plt.subplots(figsize=(11, 6)) sns.boxplot(x="alpha", y="test_score", data=df, order=order, ax=ax) _, xtext = plt.xticks() for t in xtext: t.set_rotation("vertical")
plot_performance(gcv)
We can observe that the model seems to be relative robust with respect to the choice for \(\alpha\) for this dataset. Let’s fit a model using the \(\alpha\) value that performed best.
estimator.set_params(**gcv.best_params_) estimator.fit(x, y)
FastSurvivalSVM(alpha=0.00390625, max_iter=1000, optimizer='avltree', random_state=0, tol=1e-05)
It is important to remember that only if the ranking objective is used exclusively (\(r = 1\)), that predictions denote risk scores, i.e. a higher predicted value indicates shorter survival, a lower value longer survival.
pred = estimator.predict(x.iloc[:2]) print(np.round(pred, 3)) print(y[:2])
[-1.59 -1.687] [( True, 72.) ( True, 411.)]
The model predicted that the first sample has a lower risk than the second sample, which is in concordance with the actual survival times.
If the regression objective is used (\(r < 1\)), the semantics are different, because now predictions are on the time scale and lower predicted values indicate shorter survival, higher values longer survival. Moreover, we saw from the histogram of observed times above that the distribution is skewed, therefore it is advised to log-transform the observed time before fitting a model. Here, we are going to use the transformation \(y^\prime = \log(1 + y)\).
y_log_t = y.copy() y_log_t["Survival_in_days"] = np.log1p(y["Survival_in_days"])
Let’s fit a model using the regression objective (\(r = 0\)) and compare its performance to the ranking model from above.
ref_estimator = FastSurvivalSVM(rank_ratio=0.0, max_iter=1000, tol=1e-5, random_state=0) ref_estimator.fit(x, y_log_t) cindex = concordance_index_censored( y['Status'], y['Survival_in_days'], -ref_estimator.predict(x), # flip sign to obtain risk scores ) print(round(cindex[0], 3))
0.724
Note that concordance_index_censored expects risk scores, therefore, we had to flip the sign of predictions. The resulting performance of the regression model is comparable to the of the ranking model above.
Finally, when predicting survival time, we need to invert the log-transformation via \(y = \exp(y^\prime) - 1\).
pred_log = ref_estimator.predict(x.iloc[:2]) pred_y = np.expm1(pred_log) print(np.round(pred_y, 3))
[106.683 131.105]
The Kernel Survival Support Vector Machine is a generalization of the Linear Survival Support Vector Machine that can account for more complex relationships between features and survival time, it is implemented in sksurv.svm.FastKernelSurvivalSVM. The disadvantage is that the choice of kernel function and its hyper-parameters is often not straightforward and requires tuning to obtain good results. For instance, the Radial Basis Function has a hyper-parameter \(\gamma\) that needs to be set: \(k(x, x^\prime) = \exp(-\gamma \|x-x^\prime \|^2)\). There are many other built-in kernel functions that can be used by passing their name as kernel parameter to FastKernelSurvivalSVM.
kernel
FastKernelSurvivalSVM
In this example, we are going to use the clinical kernel, because it is able to distinguish between continuous, ordinal, and nominal attributes. As this is a custom kernel function, we first need to pre-compute the kernel matrix.
from sksurv.kernels import clinical_kernel from sksurv.svm import FastKernelSurvivalSVM
[18]:
kernel_matrix = clinical_kernel(data_x)
As with the Linear Survival Support Vector Machine above, we are going to determine the optimal \(\alpha\) value by using GridSearchCV.
[19]:
kssvm = FastKernelSurvivalSVM(optimizer="rbtree", kernel="precomputed", random_state=0) kgcv = GridSearchCV(kssvm, param_grid, scoring=score_survival_model, n_jobs=4, refit=False, cv=cv)
Note that when using a custom kernel, we do not pass the original data (data_x) to the fit function, but the pre-computed, square kernel matrix.
[20]:
import warnings warnings.filterwarnings("ignore", category=FutureWarning) kgcv = kgcv.fit(kernel_matrix, y)
Now, print the best average concordance index and the corresponding parameters.
[21]:
round(kgcv.best_score_, 3), kgcv.best_params_
(0.709, {'alpha': 0.015625})
Finally, we visualize the distribution of test scores obtained via cross-validation.
[22]:
plot_performance(kgcv)
We can see that the choice of \(\alpha\) is much more important here, compared to the Linear Survival Support Vector Machine. Nevertheless, the best performance is below that of the linear model, which illustrates that choosing a good kernel function is essential, but also a non-trivial task.
Pölsterl, S., Navab, N., and Katouzian, A., Fast Training of Support Vector Machines for Survival Analysis, Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal, Lecture Notes in Computer Science, vol. 9285, pp. 243-259 (2015) Pölsterl, S., Navab, N., and Katouzian, A., An Efficient Training Algorithm for Kernel Survival Support Vector Machines 4th Workshop on Machine Learning in Life Sciences, 23 September 2016, Riva del Garda, Italy
Pölsterl, S., Navab, N., and Katouzian, A., Fast Training of Support Vector Machines for Survival Analysis, Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal, Lecture Notes in Computer Science, vol. 9285, pp. 243-259 (2015)
Pölsterl, S., Navab, N., and Katouzian, A., An Efficient Training Algorithm for Kernel Survival Support Vector Machines 4th Workshop on Machine Learning in Life Sciences, 23 September 2016, Riva del Garda, Italy
This page gives an overview of all public objects, functions and methods.
get_x_y(data_frame, attr_labels[, …])
get_x_y
Split data frame into features and labels.
load_aids([endpoint])
load_aids
Load and return the AIDS Clinical Trial dataset
load_arff_files_standardized(path_training, …)
load_arff_files_standardized
Load dataset in ARFF format.
load_breast_cancer()
load_breast_cancer
Load and return the breast cancer dataset
load_flchain()
load_flchain
Load and return assay of serum free light chain for 7874 subjects.
load_gbsg2()
load_gbsg2
Load and return the German Breast Cancer Study Group 2 dataset
load_whas500()
load_whas500
Load and return the Worcester Heart Attack Study dataset
load_veterans_lung_cancer()
load_veterans_lung_cancer
Load and return data from the Veterans’ Administration Lung Cancer Trial
ComponentwiseGradientBoostingSurvivalAnalysis([…])
ComponentwiseGradientBoostingSurvivalAnalysis
Gradient boosting with component-wise least squares as base learner.
GradientBoostingSurvivalAnalysis([loss, …])
GradientBoostingSurvivalAnalysis
Gradient-boosted Cox proportional hazard loss with regression trees as base learner.
RandomSurvivalForest([n_estimators, …])
RandomSurvivalForest
A random survival forest.
StepFunction(x, y[, a, b])
StepFunction
Callable step function.
compare_survival(y, group_indicator[, …])
compare_survival
K-sample log-rank hypothesis test of identical survival functions.
loadarff(filename)
loadarff
Load ARFF file
writearff(data, filename[, relation_name, index])
writearff
Write ARFF file
ClinicalKernelTransform([fit_once, …])
ClinicalKernelTransform
Transform data using a clinical Kernel
clinical_kernel(x[, y])
clinical_kernel
Computes clinical kernel
CoxnetSurvivalAnalysis([n_alphas, alphas, …])
CoxnetSurvivalAnalysis
Cox’s proportional hazard’s model with elastic net penalty.
CoxPHSurvivalAnalysis([alpha, ties, n_iter, …])
CoxPHSurvivalAnalysis
Cox proportional hazards model.
IPCRidge([alpha, fit_intercept, normalize, …])
IPCRidge
Accelerated failure time model with inverse probability of censoring weights.
EnsembleSelection(base_estimators[, scorer, …])
EnsembleSelection
Ensemble selection for survival analysis that accounts for a score and correlations between predictions.
EnsembleSelectionRegressor(base_estimators)
EnsembleSelectionRegressor
Ensemble selection for regression that accounts for the accuracy and correlation of errors.
Stacking(meta_estimator, base_estimators[, …])
Stacking
Meta estimator that combines multiple base learners.
brier_score(survival_train, survival_test, …)
brier_score
Estimate the time-dependent Brier score for right censored data.
concordance_index_censored(event_indicator, …)
Concordance index for right-censored data
concordance_index_ipcw(survival_train, …)
Concordance index for right-censored data based on inverse probability of censoring weights.
cumulative_dynamic_auc(survival_train, …)
cumulative_dynamic_auc
Estimator of cumulative/dynamic AUC for right-censored time-to-event data.
integrated_brier_score(survival_train, …)
integrated_brier_score
The Integrated Brier Score (IBS) provides an overall calculation of the model performance at all available times \(t_1 \leq t \leq t_\text{max}\).
CensoringDistributionEstimator()
CensoringDistributionEstimator
Kaplan–Meier estimator for the censoring distribution.
SurvivalFunctionEstimator()
SurvivalFunctionEstimator
Kaplan–Meier estimate of the survival function.
ipc_weights(event, time)
ipc_weights
Compute inverse probability of censoring weights
kaplan_meier_estimator(event, time_exit[, …])
Kaplan-Meier estimator of survival function.
nelson_aalen_estimator(event, time)
nelson_aalen_estimator
Nelson-Aalen estimator of cumulative hazard function.
OneHotEncoder([allow_drop])
OneHotEncoder
Encode categorical columns with M categories into M-1 columns according to the one-hot scheme.
categorical_to_numeric(table)
categorical_to_numeric
Encode categorical columns to numeric by converting each category to an integer value.
encode_categorical(table[, columns])
encode_categorical
standardize(table[, with_std])
standardize
Perform Z-Normalization on each numeric column of the given table.
HingeLossSurvivalSVM([solver, alpha, …])
HingeLossSurvivalSVM
Naive implementation of kernel survival support vector machine.
FastKernelSurvivalSVM([alpha, rank_ratio, …])
Efficient Training of kernel Survival Support Vector Machine.
FastSurvivalSVM([alpha, rank_ratio, …])
FastSurvivalSVM
Efficient Training of linear Survival Support Vector Machine
MinlipSurvivalAnalysis([solver, alpha, …])
MinlipSurvivalAnalysis
Survival model related to survival SVM, using a minimal Lipschitz smoothness strategy instead of a maximal margin strategy.
NaiveSurvivalSVM([penalty, loss, dual, tol, …])
NaiveSurvivalSVM
Naive version of linear Survival Support Vector Machine.
SurvivalTree([splitter, max_depth, …])
SurvivalTree
A survival tree.
Surv()
Surv
Helper class to construct structured array of event indicator and observed time.
This page explains how you can contribute to the development of scikit-survival. There are a lot of ways you can contribute:
Writing new code, e.g. implementations of new algorithms, or examples.
Fixing bugs.
Improving documentation.
Reviewing open pull requests.
scikit-survival is developed on GitHub using the Git version control system. The preferred way to contribute to scikit-survival is to fork the main repository on GitHub, then submit a pull request (PR).
These are the steps you need to take to create a copy of the scikit-survival repository on your computer.
Create an account on GitHub if you do not already have one.
Fork the scikit-survival repository.
Clone your fork of the scikit-survival repository from your GitHub account to your local disk. You have to execute from the command line:
git clone --recurse-submodules git@github.com:YourLogin/scikit-survival.git cd scikit-survival
After you created a copy of our main repository on GitHub, your need to setup a local development environment. We strongly recommend to use conda to create a separate virtual environment containing all dependencies. These are the steps you need to take.
Install conda for your operating system if you haven’t already.
Create a new environment, named sksurv:
sksurv
python ci/list-requirements.py requirements/dev.txt > dev-requirements.txt conda create -n sksurv -c sebp python=3 --file dev-requirements.txt
Activate the newly created environment:
conda activate sksurv
Compile the C/C++ extensions and install scikit-survival in development mode:
pip install --no-build-isolation -e .
For a pull request to be accepted, your changes must meet the below requirements.
All changes related to one feature must belong to one branch. Each branch must be self-contained, with a single new feature or bugfix. Create a new feature branch by executing:
git checkout -b my-new-feature
All code must follow the standard Python guidelines for code style, PEP8. To check that your code conforms to PEP8, you can install tox and run:
tox -e py38-flake8
Each function, class, method, and attribute needs to be documented using docstrings. scikit-survival conforms to the numpy docstring standard.
Code submissions must always include unit tests. We are using pytest. All tests must be part of the tests directory. You can run the test suite locally by executing:
tests
py.test tests/
Tests will also be executed automatically once you submit a pull request.
The contributed code will be licensed under the GNU General Public License v3.0. If you did not write the code yourself, you must ensure the existing license is compatible and include the license information in the contributed files, or obtain a permission from the original author to relicense the contributed code.
When you are done coding in your feature branch, add changed or new files:
git add path/to/modified_file
Create a commit message describing what you changed. Commit messages should be clear and concise. The first line should contain the subject of the commit and not exceed 80 characters in length. If necessary, add a blank line after the subject followed by a commit message body with more details:
git commit
Push the changes to GitHub:
git push -u origin my_feature
Create a pull request.
The documentation resides in the doc/ folder and is written in reStructuredText. HTML files of the documentation can be generated using Sphinx. The easiest way to build the documentation is to install tox and run:
doc/
tox -e py38-docs
Generated files will be located in doc/_build/html. To open the main page of the documentation, run:
doc/_build/html
xdg-open _build/html/index.html
This release adds support for scikit-learn 0.24 and Python 3.9. scikit-survival now requires at least pandas 0.25 and scikit-learn 0.24. Moreover, if sksurv.ensemble.GradientBoostingSurvivalAnalysis. or sksurv.ensemble.ComponentwiseGradientBoostingSurvivalAnalysis are fit with loss='coxph', predict_cumulative_hazard_function and predict_survival_function are now available. sksurv.metrics.cumulative_dynamic_auc() now supports evaluating time-dependent predictions, for instance for a sksurv.ensemble.RandomSurvivalForest as illustrated in the User Guide.
sksurv.ensemble.GradientBoostingSurvivalAnalysis
sksurv.ensemble.ComponentwiseGradientBoostingSurvivalAnalysis
loss='coxph'
sksurv.metrics.cumulative_dynamic_auc()
sksurv.ensemble.RandomSurvivalForest
Allow passing pandas data frames to all fit and predict methods (#148).
predict
Allow sparse matrices to be passed to sksurv.ensemble.GradientBoostingSurvivalAnalysis.predict().
sksurv.ensemble.GradientBoostingSurvivalAnalysis.predict()
Fix example in user guide using GridSearchCV to determine alphas for CoxnetSurvivalAnalysis (#186).
Add score method to sksurv.meta.Stacking, sksurv.meta.EnsembleSelection, and sksurv.meta.EnsembleSelectionRegressor (#151).
sksurv.meta.Stacking
sksurv.meta.EnsembleSelection
sksurv.meta.EnsembleSelectionRegressor
Add support for predict_cumulative_hazard_function and predict_survival_function to sksurv.ensemble.GradientBoostingSurvivalAnalysis. and sksurv.ensemble.ComponentwiseGradientBoostingSurvivalAnalysis if model was fit with loss='coxph'.
Add support for time-dependent predictions to sksurv.metrics.cumulative_dynamic_auc() See the User Guide for an example (#134).
The score method of sksurv.linear_model.IPCRidge, sksurv.svm.FastSurvivalSVM, and sksurv.svm.FastKernelSurvivalSVM (if rank_ratio is smaller than 1) now converts predictions on log(time) scale to risk scores prior to computing the concordance index.
sksurv.linear_model.IPCRidge
sksurv.svm.FastSurvivalSVM
sksurv.svm.FastKernelSurvivalSVM
rank_ratio
Support for cvxpy and cvxopt solver in sksurv.svm.MinlipSurvivalAnalysis and sksurv.svm.HingeLossSurvivalSVM has been dropped. The default solver is now ECOS, which was used by cvxpy (the previous default) internally. Therefore, results should be identical.
sksurv.svm.MinlipSurvivalAnalysis
sksurv.svm.HingeLossSurvivalSVM
Dropped the presort argument from sksurv.tree.SurvivalTree and sksurv.ensemble.GradientBoostingSurvivalAnalysis.
presort
sksurv.tree.SurvivalTree
The X_idx_sorted argument in sksurv.tree.SurvivalTree.fit() has been deprecated in scikit-learn 0.24 and has no effect now.
X_idx_sorted
sksurv.tree.SurvivalTree.fit()
predict_cumulative_hazard_function and predict_survival_function of sksurv.ensemble.RandomSurvivalForest and sksurv.tree.SurvivalTree now return an array of sksurv.functions.StepFunction objects by default. Use return_array=True to get the old behavior.
sksurv.functions.StepFunction
return_array=True
Support for Python 3.6 has been dropped.
Increase minimum supported versions of dependencies. We now require:
Package Minimum Version Pandas 0.25.0 scikit-learn 0.24.0
Package
Minimum Version
Pandas
0.25.0
scikit-learn
0.24.0
This release features a complete overhaul of the documentation. It features a new visual design, and the inclusion of several interactive notebooks in the User Guide.
In addition, it includes important bug fixes. It fixes several bugs in sksurv.linear_model.CoxnetSurvivalAnalysis where predict, predict_survival_function, and predict_cumulative_hazard_function returned wrong values if features of the training data were not centered. Moreover, the score function of sksurv.ensemble.ComponentwiseGradientBoostingSurvivalAnalysis and sksurv.ensemble.GradientBoostingSurvivalAnalysis will now correctly compute the concordance index if loss='ipcwls' or loss='squared'.
sksurv.linear_model.CoxnetSurvivalAnalysis
predict_cumulative_hazard_function
loss='ipcwls'
loss='squared'
sksurv.column.standardize() modified data in-place. Data is now always copied.
sksurv.column.standardize()
sksurv.column.standardize() works with integer numpy arrays now.
sksurv.column.standardize() used biased standard deviation for numpy arrays (ddof=0), but unbiased standard deviation for pandas objects (ddof=1). It always uses ddof=1 now. Therefore, the output, if the input is a numpy array, will differ from that of previous versions.
ddof=0
ddof=1
Fixed sksurv.linear_model.CoxnetSurvivalAnalysis.predict_survival_function() and sksurv.linear_model.CoxnetSurvivalAnalysis.predict_cumulative_hazard_function(), which returned wrong values if features of training data were not already centered. This adds an offset_ attribute that accounts for non-centered data and is added to the predicted risk score. Therefore, the outputs of predict, predict_survival_function, and predict_cumulative_hazard_function will be different to previous versions for non-centered data (#139).
sksurv.linear_model.CoxnetSurvivalAnalysis.predict_survival_function()
sksurv.linear_model.CoxnetSurvivalAnalysis.predict_cumulative_hazard_function()
offset_
Rescale coefficients of sksurv.linear_model.CoxnetSurvivalAnalysis if normalize=True.
Fix score function of sksurv.ensemble.ComponentwiseGradientBoostingSurvivalAnalysis and sksurv.ensemble.GradientBoostingSurvivalAnalysis if loss='ipcwls' or loss='squared' is used. Previously, it returned 1.0 - true_cindex.
1.0 - true_cindex
Add sksurv.show_versions() that prints the version of all dependencies.
sksurv.show_versions()
Add support for pandas 1.1
Include interactive notebooks in documentation on readthedocs.
Add user guide on penalized Cox models.
Add user guide on gradient boosted models.
This release fixes warnings that were introduced with 0.13.0.
Explicitly pass return_array=True in sksurv.tree.SurvivalTree.predict() to avoid FutureWarning.
sksurv.tree.SurvivalTree.predict()
Fix error when fitting sksurv.tree.SurvivalTree with non-float dtype for time (#127).
Fix RuntimeWarning: invalid value encountered in true_divide in sksurv.nonparametric.kaplan_meier_estimator().
Fix PendingDeprecationWarning about use of matrix when fitting sksurv.svm.FastSurvivalSVM if optimizer is PRSVM or simple.
The highlights of this release include the addition of sksurv.metrics.brier_score() and sksurv.metrics.integrated_brier_score() and compatibility with scikit-learn 0.23.
sksurv.metrics.brier_score()
sksurv.metrics.integrated_brier_score()
predict_survival_function and predict_cumulative_hazard_function of sksurv.ensemble.RandomSurvivalForest and sksurv.tree.SurvivalTree can now return an array of sksurv.functions.StepFunction, similar to sksurv.linear_model.CoxPHSurvivalAnalysis by specifying return_array=False. This will be the default behavior starting with 0.14.0.
return_array=False
Note that this release fixes a bug in estimating inverse probability of censoring weights (IPCW), which will affect all estimators relying on IPCW.
Make build system compatible with PEP-517/518.
Added sksurv.metrics.brier_score() and sksurv.metrics.integrated_brier_score() (#101).
sksurv.functions.StepFunction can now be evaluated at multiple points in a single call.
Update documentation on usage of predict_survival_function and predict_cumulative_hazard_function (#118).
The default value of alpha_min_ratio of sksurv.linear_model.CoxnetSurvivalAnalysis will now depend on the n_samples/n_features ratio. If n_samples > n_features, the default value is 0.0001 If n_samples <= n_features, the default value is 0.01.
n_samples > n_features
n_samples <= n_features
Add support for scikit-learn 0.23 (#119).
predict_survival_function and predict_cumulative_hazard_function of sksurv.ensemble.RandomSurvivalForest and sksurv.tree.SurvivalTree will return an array of sksurv.functions.StepFunction in the future (as sksurv.linear_model.CoxPHSurvivalAnalysis does). For the old behavior, use return_array=True.
Fix deprecation of importing joblib via sklearn.
Fix estimation of censoring distribution for tied times with events. When estimating the censoring distribution, by specifying reverse=True when calling sksurv.nonparametric.kaplan_meier_estimator(), we now consider events to occur before censoring. For tied time points with an event, those with an event are not considered at risk anymore and subtracted from the denominator of the Kaplan-Meier estimator. The change affects all functions relying on inverse probability of censoring weights, namely:
reverse=True
sksurv.nonparametric.CensoringDistributionEstimator
sksurv.nonparametric.ipc_weights()
sksurv.metrics.concordance_index_ipcw()
Throw an exception when trying to estimate c-index from uncomparable data (#117).
Estimators in sksurv.svm will now throw an exception when trying to fit a model to data with uncomparable pairs.
sksurv.svm
This release adds support for scikit-learn 0.22, thereby dropping support for older versions. Moreover, the regularization strength of the ridge penalty in sksurv.linear_model.CoxPHSurvivalAnalysis can now be set per feature. If you want one or more features to enter the model unpenalized, set the corresponding penalty weights to zero. Finally, sklearn.pipeline.Pipeline will now be automatically patched to add support for predict_cumulative_hazard_function and predict_survival_function if the underlying estimator supports it.
sklearn.pipeline.Pipeline
Add scikit-learn’s deprecation of presort in sksurv.tree.SurvivalTree and sksurv.ensemble.GradientBoostingSurvivalAnalysis.
Add warning that default alpha_min_ratio in sksurv.linear_model.CoxnetSurvivalAnalysis will depend on the ratio of the number of samples to the number of features in the future (#41).
Add references to API doc of sksurv.ensemble.GradientBoostingSurvivalAnalysis (#91).
Add support for pandas 1.0 (#100).
Add ccp_alpha parameter for Minimal Cost-Complexity Pruning to sksurv.ensemble.GradientBoostingSurvivalAnalysis.
Patch sklearn.pipeline.Pipeline to add support for predict_cumulative_hazard_function and predict_survival_function if the underlying estimator supports it.
Allow per-feature regularization for sksurv.linear_model.CoxPHSurvivalAnalysis (#102).
Clarify API docs of sksurv.metrics.concordance_index_censored() (#96).
This release adds sksurv.tree.SurvivalTree and sksurv.ensemble.RandomSurvivalForest, which are based on the log-rank split criterion. It also adds the OSQP solver as option to sksurv.svm.MinlipSurvivalAnalysis and sksurv.svm.HingeLossSurvivalSVM, which will replace the now deprecated cvxpy and cvxopt options in a future release.
This release removes support for sklearn 0.20 and requires sklearn 0.21.
The cvxpy and cvxopt options for solver in sksurv.svm.MinlipSurvivalAnalysis and sksurv.svm.HingeLossSurvivalSVM are deprecated and will be removed in a future version. Choosing osqp is the preferred option now.
Add support for pandas 0.25.
Add OSQP solver option to sksurv.svm.MinlipSurvivalAnalysis and sksurv.svm.HingeLossSurvivalSVM which has no additional dependencies.
Fix issue when using cvxpy 1.0.16 or later.
Explicitly specify utf-8 encoding when reading README.rst (#89).
Add sksurv.tree.SurvivalTree and sksurv.ensemble.RandomSurvivalForest (#90).
Exclude Cython-generated files from source distribution because they are not forward compatible.
This release adds the ties argument to sksurv.linear_model.CoxPHSurvivalAnalysis to choose between Breslow’s and Efron’s likelihood in the presence of tied event times. Moreover, sksurv.compare.compare_survival() has been added, which implements the log-rank hypothesis test for comparing the survival function of 2 or more groups.
sksurv.compare.compare_survival()
Update API doc of predict function of boosting estimators (#75).
Clarify documentation for GradientBoostingSurvivalAnalysis (#78).
Implement Efron’s likelihood for handling tied event times.
Implement log-rank test for comparing survival curves.
Add support for scipy 1.3.1 (#66).
Re-add baseline_survival_ and cum_baseline_hazard_ attributes to sksurv.linear_model.CoxPHSurvivalAnalysis (#76).
This release adds support for sklearn 0.21 and pandas 0.24.
Add reference to IPCRidge (#65).
Use scipy.special.comb instead of deprecated scipy.misc.comb.
Add support for pandas 0.24 and drop support for 0.20.
Add support for scikit-learn 0.21 and drop support for 0.20 (#71).
Explain use of intercept in ComponentwiseGradientBoostingSurvivalAnalysis (#68)
Bump Eigen to 3.3.7.
Disallow scipy 1.3.0 due to scipy regression (#66).
Add sksurv.linear_model.CoxnetSurvivalAnalysis.predict_survival_function() and sksurv.linear_model.CoxnetSurvivalAnalysis.predict_cumulative_hazard_function() (#46).
Add sksurv.nonparametric.SurvivalFunctionEstimator and sksurv.nonparametric.CensoringDistributionEstimator that wrap sksurv.nonparametric.kaplan_meier_estimator() and provide a predict_proba method for evaluating the estimated function on test data.
sksurv.nonparametric.SurvivalFunctionEstimator
Implement censoring-adjusted C-statistic proposed by Uno et al. (2011) in sksurv.metrics.concordance_index_ipcw().
Add estimator of cumulative/dynamic AUC of Uno et al. (2007) in sksurv.metrics.cumulative_dynamic_auc().
Add flchain dataset (see sksurv.datasets.load_flchain()).
sksurv.datasets.load_flchain()
The tied_time return value of sksurv.metrics.concordance_index_censored() now correctly reflects the number of comparable pairs that share the same time and that are used in computing the concordance index.
Fix a bug in sksurv.metrics.concordance_index_censored() where a pair with risk estimates within tolerance was counted both as concordant and tied.
This release adds support for Python 3.7 and sklearn 0.20.
Changes:
Add support for sklearn 0.20 (#48).
Migrate to py.test (#50).
Explicitly request ECOS solver for sksurv.svm.MinlipSurvivalAnalysis and sksurv.svm.HingeLossSurvivalSVM.
Add support for Python 3.7 (#49).
Add support for cvxpy >=1.0.
Add support for numpy 1.15.
This release adds support for numpy 1.14 and pandas up to 0.23. In addition, the new class sksurv.util.Surv makes it easier to construct a structured array from numpy arrays, lists, or a pandas data frame.
sksurv.util.Surv
Support numpy 1.14 and pandas 0.22, 0.23 (#36).
Enable support for cvxopt with Python 3.5+ on Windows (requires cvxopt >=1.1.9).
Add max_iter parameter to sksurv.svm.MinlipSurvivalAnalysis and sksurv.svm.HingeLossSurvivalSVM.
Fix score function of sksurv.svm.NaiveSurvivalSVM to use concordance index.
sksurv.svm.NaiveSurvivalSVM
sksurv.linear_model.CoxnetSurvivalAnalysis now throws an exception if coefficients get too large (#47).
Add sksurv.util.Surv class to ease constructing a structured array (#26).
This release adds support for scikit-learn 0.19 and pandas 0.21. In turn, support for older versions is dropped, namely Python 3.4, scikit-learn 0.18, and pandas 0.18.
This release adds sksurv.linear_model.CoxnetSurvivalAnalysis, which implements an efficient algorithm to fit Cox’s proportional hazards model with LASSO, ridge, and elastic net penalty. Moreover, it includes support for Windows with Python 3.5 and later by making the cvxopt package optional.
This release adds sksurv.linear_model.CoxPHSurvivalAnalysis.predict_survival_function() and sksurv.linear_model.CoxPHSurvivalAnalysis.predict_cumulative_hazard_function(), which return the survival function and cumulative hazard function using Breslow’s estimator. Moreover, it fixes a build error on Windows (gh #3) and adds the sksurv.preprocessing.OneHotEncoder class, which can be used in a scikit-learn pipeline.
sksurv.preprocessing.OneHotEncoder
This release adds support for Python 3.6, and pandas 0.19 and 0.20.
This is the initial release of scikit-survival. It combines the implementation of survival support vector machines with the code used in the Prostate Cancer DREAM challenge.
If you are using scikit-survival in your scientific research, please cite the following paper:
S. Pölsterl, “scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn,” Journal of Machine Learning Research, vol. 21, no. 212, pp. 1–6, 2020.
BibTeX entry:
@article{sksurv, author = {Sebastian P{\"o}lsterl}, title = {scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn}, journal = {Journal of Machine Learning Research}, year = {2020}, volume = {21}, number = {212}, pages = {1-6}, url = {http://jmlr.org/papers/v21/20-729.html} }