scikit-survival¶
scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizing the power of scikit-learn, e.g., for pre-processing or doing cross-validation.
The objective in survival analysis (also referred to as time-to-event or reliability analysis) is to establish a connection between covariates and the time of an event. What makes survival analysis differ from traditional machine learning is the fact that parts of the training data can only be partially observed – they are censored.
For instance, in a clinical study, patients are often monitored for a particular time period, and events occurring in this particular period are recorded. If a patient experiences an event, the exact time of the event can be recorded – the patient’s record is uncensored. In contrast, right censored records refer to patients that remained event-free during the study period and it is unknown whether an event has or has not occurred after the study ended. Consequently, survival analysis demands for models that take this unique characteristic of such a dataset into account.
Installation¶
The easiest way to install scikit-survival is to use Anaconda by running:
conda install -c sebp scikit-survival
Alternatively, you can install scikit-survival from source following this guide.
Documentation¶
Installing scikit-survival¶
This is the recommended and easiest to install scikit-survival is to use Anaconda. Alternatively, you can install scikit-survival From Source.
Anaconda¶
Pre-built binary packages for Linux, MacOS, and Windows are available for Anaconda. If you have Anaconda installed, run:
conda install -c sebp scikit-survival
From Source¶
If you want to build scikit-survival from source, you will need a C/C++ compiler to compile extensions.
Linux
On Linux, you need to install gcc, which in most cases is available via your distribution’s packaging system. Please follow your distribution’s instructions on how to install packages.
MacOS
On MacOS, you need to install clang, which is available from the Command Line Tools package. Open a terminal and excecute:
xcode-select --install
Alternatively, you can download it from the Apple Developers page. Log in with your Apple ID, then search and download the Command Line Tools for Xcode package.
Windows
On Windows, the compiler you need depends on the Python version you are using. See this guide to determine which Microsoft Visual C++ compiler to use with a specific Python version.
Latest Release¶
To install the latest release of scikit-survival from source, run:
pip install scikit-survival
Development Version¶
To install the latest source from our GitHub repository, you need to have Git installed and simply run:
pip install git+https://github.com/sebp/scikit-survival.git
Dependencies¶
The current minimum dependencies to run scikit-survival are:
- Python 3.5 or later
- cvxpy
- cvxopt
- joblib
- numexpr
- numpy 1.12 or later
- osqp
- pandas 0.21 or later
- scikit-learn 0.22
- scipy 1.0 or later
- C/C++ compiler
Understanding Predictions in Survival Analysis¶
What is Survival Analysis?¶
The objective in survival analysis — also referred to as reliability analysis in engineering — is to establish a connection between covariates and the time of an event. The name survival analysis originates from clinical research, where predicting the time to death, i.e., survival, is often the main objective. Survival analysis is a type of regression problem (one wants to predict a continuous value), but with a twist. It differs from traditional regression by the fact that parts of the training data can only be partially observed – they are censored.
As an example, consider a clinical study, which investigates cardiovascular disease and has been carried out over a 1 year period as in the figure below.
Patient A was lost to follow-up after three months with no recorded cardiovascular event, patient B experienced an event four and a half months after enrollment, patient C withdrew from the study three and a half months after enrollment, and patient E did not experience any event before the study ended. Consequently, the exact time of a cardiovascular event could only be recorded for patients B and D; their records are uncensored. For the remaining patients it is unknown whether they did or did not experience an event after termination of the study. The only valid information that is available for patients A, C, and E is that they were event-free up to their last follow-up. Therefore, their records are censored.
Formally, each patient record consists of a set of covariates \(x \in \mathbb{R}^d\) , and the time \(t>0\) when an event occurred or the time \(c>0\) of censoring. Since censoring and experiencing and event are mutually exclusive, it is common to define an event indicator \(\delta \in \{0;1\}\) and the observable survival time \(y>0\). The observable time \(y\) of a right censored sample is defined as
Consequently, survival analysis demands for models that take this unique characteristic of such a dataset into account.
Basic Quantities¶
Rather than focusing on predicting a single point in time of an event, the prediction step in survival analysis often focuses on predicting a function: either the survival or hazard function. The survival function \(S(t)\) returns the probability of survival beyond time \(t\), i.e., \(S(t) = P(T > t)\), whereas the hazard function \(h(t)\) denotes an approximate probability (it is not bounded from above) that an event occurs in the small time interval \([t; t + \Delta t[\), under the condition that an individual would remain event-free up to time \(t\):
Alternative names for the hazard function are conditional failure rate, conditional mortality rate, or instantaneous failure rate. In contrast to the survival function, which describes the absence of an event, the hazard function provides information about the occurrence of an event. Finally, the cumulative hazard function \(H(t)\) is the integral over the interval \([0; t]\) of the hazard function:
Predictions¶
The survival function \(S(t)\) and cumulative hazard function \(H(t)\) can be estimated
from a set of observed time points \(\{(y_1, \delta_i), \ldots, (y_n, \delta_n)\}\) using
sksurv.nonparametric.kaplan_meier_estimator() and sksurv.nonparametric.nelson_aalen_estimator(),
respectively.
The above estimators are often too simple, because they do not take additional factors into account
that could affect survival, e.g. age or a pre-existing condition.
Cox’s proportional hazards model (sksurv.linear_model.CoxPHSurvivalAnalysis) provides
a way to estimate survival and cumulative hazard function in the presence of additional covariates.
This is possible, because it assumes that a baseline hazard function exists and that covariates
change the “risk” (hazard) only proportionally. In other words, it assumes that the ratio of
the “risk” of experiencing an event of two patients remains constant over time.
After fitting Cox’s proportional hazards model, \(S(t)\) and \(H(t)\) can be estimated
using sksurv.linear_model.CoxPHSurvivalAnalysis.predict_survival_function() and
sksurv.linear_model.CoxPHSurvivalAnalysis.predict_cumulative_hazard_function(), respectively.
Important
For other survival models that do not rely on the proportional hazards assumption,
it is often impossible to estimate survival or cumulative hazard function.
Their predictions are risk scores of arbitrary scale. If samples are ordered according to
their predicted risk score (in ascending order), one obtains the sequence of events,
as predicted by the model.
This is the return value of the predict() method of all survival models in scikit-survival.
Consequently, predictions are often evaluated by a measure of rank correlation between predicted risk scores
and observed time points in the test data. In particular, Harrell’s concordance index
(sksurv.metrics.concordance_index_censored()) computes the ratio of correctly ordered
(concordant) pairs to comparable pairs and is the default performance metric when calling
a survival model’s score() method.
API reference¶
Datasets¶
get_x_y(data_frame, attr_labels[, …]) |
Split data frame into features and labels. |
load_aids([endpoint]) |
Load and return the AIDS Clinical Trial dataset |
load_arff_files_standardized(path_training, …) |
Load dataset in ARFF format. |
load_breast_cancer() |
Load and return the breast cancer dataset |
load_flchain() |
Load and return assay of serum free light chain for 7874 subjects. |
load_gbsg2() |
Load and return the German Breast Cancer Study Group 2 dataset |
load_whas500() |
Load and return the Worcester Heart Attack Study dataset |
load_veterans_lung_cancer() |
Load and return data from the Veterans’ Administration Lung Cancer Trial |
Ensemble Models¶
ComponentwiseGradientBoostingSurvivalAnalysis([…]) |
Gradient boosting with component-wise least squares as base learner. |
GradientBoostingSurvivalAnalysis([loss, …]) |
Gradient-boosted Cox proportional hazard loss with regression trees as base learner. |
RandomSurvivalForest([n_estimators, …]) |
A random survival forest. |
Functions¶
StepFunction(x, y[, a, b]) |
Callable step function. |
Hypothesis testing¶
compare_survival(y, group_indicator[, …]) |
K-sample log-rank hypothesis test of identical survival functions. |
I/O Utilities¶
loadarff(filename) |
Load ARFF file |
writearff(data, filename[, relation_name, index]) |
Write ARFF file |
Kernels¶
ClinicalKernelTransform([fit_once, …]) |
Transform data using a clinical Kernel |
clinical_kernel(x[, y]) |
Computes clinical kernel |
Linear Models¶
CoxnetSurvivalAnalysis([n_alphas, alphas, …]) |
Cox’s proportional hazard’s model with elastic net penalty. |
CoxPHSurvivalAnalysis([alpha, ties, n_iter, …]) |
Cox proportional hazards model. |
IPCRidge([alpha, fit_intercept, normalize, …]) |
Accelerated failure time model with inverse probability of censoring weights. |
Meta Models¶
EnsembleSelection(base_estimators[, scorer, …]) |
Ensemble selection for survival analysis that accounts for a score and correlations between predictions. |
EnsembleSelectionRegressor(base_estimators) |
Ensemble selection for regression that accounts for the accuracy and correlation of errors. |
Stacking(meta_estimator, base_estimators[, …]) |
Meta estimator that combines multiple base learners. |
Metrics¶
brier_score(survival_train, survival_test, …) |
Estimate the time-dependent Brier score for right censored data. |
concordance_index_censored(event_indicator, …) |
Concordance index for right-censored data |
concordance_index_ipcw(survival_train, …) |
Concordance index for right-censored data based on inverse probability of censoring weights. |
cumulative_dynamic_auc(survival_train, …) |
Estimator of cumulative/dynamic AUC for right-censored time-to-event data. |
integrated_brier_score(survival_train, …) |
The Integrated Brier Score (IBS) provides an overall calculation of the model performance at all available times \(t_1 \leq t \leq t_\text{max}\). |
Non-parametric Estimators¶
CensoringDistributionEstimator() |
Kaplan–Meier estimator for the censoring distribution. |
SurvivalFunctionEstimator() |
Kaplan–Meier estimate of the survival function. |
ipc_weights(event, time) |
Compute inverse probability of censoring weights |
kaplan_meier_estimator(event, time_exit[, …]) |
Kaplan-Meier estimator of survival function. |
nelson_aalen_estimator(event, time) |
Nelson-Aalen estimator of cumulative hazard function. |
Pre-Processing¶
OneHotEncoder([allow_drop]) |
Encode categorical columns with M categories into M-1 columns according to the one-hot scheme. |
categorical_to_numeric(table) |
Encode categorical columns to numeric by converting each category to an integer value. |
encode_categorical(table[, columns]) |
Encode categorical columns with M categories into M-1 columns according to the one-hot scheme. |
standardize(table[, with_std]) |
Perform Z-Normalization on each numeric column of the given table. |
Survival Support Vector Machine¶
HingeLossSurvivalSVM([solver, alpha, …]) |
Naive implementation of kernel survival support vector machine. |
FastKernelSurvivalSVM([alpha, rank_ratio, …]) |
Efficient Training of kernel Survival Support Vector Machine. |
FastSurvivalSVM([alpha, rank_ratio, …]) |
Efficient Training of linear Survival Support Vector Machine |
MinlipSurvivalAnalysis([solver, alpha, …]) |
Survival model related to survival SVM, using a minimal Lipschitz smoothness strategy instead of a maximal margin strategy. |
NaiveSurvivalSVM([penalty, loss, dual, tol, …]) |
Naive version of linear Survival Support Vector Machine. |
Survival Trees¶
SurvivalTree([splitter, max_depth, …]) |
A survival tree. |
Contributing Guidelines¶
This page explains how you can contribute to the development of scikit-survival. There are a lot of ways you can contribute:
- Writing new code, e.g. implementations of new algorithms, or examples.
- Fixing bugs.
- Improving documentation.
- Reviewing open pull requests.
scikit-survival is developed on GitHub using the Git version control system. The preferred way to contribute to scikit-survival is to fork the main repository on GitHub, then submit a pull request (PR).
Creating a fork¶
These are the steps you need to take to create a copy of the scikit-survival repository on your computer.
Create an account on GitHub if you do not already have one.
Clone your fork of the scikit-survival repository from your GitHub account to your local disk. You have to execute from the command line:
git clone --recurse-submodules git@github.com:YourLogin/scikit-survival.git cd scikit-survival
Setting up a Development Environment¶
After you created a copy of our main repository on GitHub, your need to setup a local development environment. We strongly recommend to use conda to create a separate virtual environment containing all dependencies. These are the steps you need to take.
Install conda for your operating system if you haven’t already.
Create a new environment, named
sksurv:python ci/list-requirements.py requirements/dev.txt > dev-requirements.txt conda create -n sksurv -c sebp python=3 --file dev-requirements.txt
Activate the newly created environment:
conda activate sksurv
Compile the C/C++ extensions and install scikit-survival in development mode:
pip install --no-build-isolation -e .
Making Changes to the Code¶
For a pull request to be accepted, your changes must meet the below requirements.
All changes related to one feature must belong to one branch. Each branch must be self-contained, with a single new feature or bugfix. Create a new feature branch by executing:
git checkout -b my-new-feature
All code must follow the standard Python guidelines for code style, PEP8. To check that your code conforms to PEP8, you can install tox and run:
tox -e flake8
Each function, class, method, and attribute needs to be documented using docstrings. scikit-survival conforms to the numpy docstring standard.
Code submissions must always include unit tests. We are using pytest. All tests must be part of the
testsdirectory. You can run the test suite locally by executing:py.test tests/
Tests will also be executed automatically once you submit a pull request.
The contributed code will be licensed under the GNU General Public License v3.0. If you did not write the code yourself, you must ensure the existing license is compatible and include the license information in the contributed files, or obtain a permission from the original author to relicense the contributed code.
Submitting a Pull Request¶
When you are done coding in your feature branch, add changed or new files:
git add path/to/modified_file
Create a commit message describing what you changed. Commit messages should be clear and concise. The first line should contain the subject of the commit and not exceed 80 characters in length. If necessary, add a blank line after the subject followed by a commit message body with more details:
git commit
Push the changes to GitHub:
git push -u origin my_feature
Building the Documentation¶
The documentation resides in the doc/ folder and is written in
reStructuredText. HTML files of the documentation can be generated using Sphinx.
The easiest way to build the documentation is to install tox and run:
tox -e docs
Generated files will be located in doc/_build/html. To open the main page
of the documentation, run:
xdg-open _build/html/index.html
Release Notes¶
scikit-survival 0.12 (2020-06-28)¶
The highlights of this release include the addition of
sksurv.metrics.brier_score() and
sksurv.metrics.integrated_brier_score()
and compatibility with scikit-learn 0.23.
predict_survival_function and predict_cumulative_hazard_function
of sksurv.ensemble.RandomSurvivalForest and
sksurv.tree.SurvivalTree can now return an array of
sksurv.functions.StepFunctions, similar
to sksurv.linear_model.CoxPHSurvivalAnalysis
by specifying return_array=False. This will be the default
behavior starting with 0.14.0.
Note that this release fixes a bug in estimating inverse probability of censoring weights (IPCW), which will affect all estimators relying on IPCW.
Enhancements¶
- Make build system compatible with PEP-517/518.
- Added
sksurv.metrics.brier_score()andsksurv.metrics.integrated_brier_score()(#101). sksurv.functions.StepFunctioncan now be evaluated at multiple points in a single call.- Update documentation on usage of predict_survival_function and predict_cumulative_hazard_function (#118).
- The default value of alpha_min_ratio of
sksurv.linear_model.CoxnetSurvivalAnalysiswill now depend on the n_samples/n_features ratio. Ifn_samples > n_features, the default value is 0.0001 Ifn_samples <= n_features, the default value is 0.01. - Add support for scikit-learn 0.23 (#119).
Deprecations¶
- predict_survival_function and predict_cumulative_hazard_function
of
sksurv.ensemble.RandomSurvivalForestandsksurv.tree.SurvivalTreewill return an array ofsksurv.functions.StepFunctionsin the future (assksurv.linear_model.CoxPHSurvivalAnalysisdoes). For the old behavior, use return_array=True.
Bug fixes¶
- Fix deprecation of importing joblib via sklearn.
- Fix estimation of censoring distribution for tied times with events.
When estimating the censoring distribution,
by specifying
reverse=Truewhen callingsksurv.nonparametric.kaplan_meier_estimator(), we now consider events to occur before censoring. For tied time points with an event, those with an event are not considered at risk anymore and subtracted from the denominator of the Kaplan-Meier estimator. The change affects all functions relying on inverse probability of censoring weights, namely: - Throw an exception when trying to estimate c-index from uncomparable data (#117).
- Estimators in
sksurv.svmwill now throw an exception when trying to fit a model to data with uncomparable pairs.
scikit-survival 0.12 (2020-04-15)¶
This release adds support for scikit-learn 0.22, thereby dropping support for
older versions. Moreover, the regularization strength of the ridge penalty
in sksurv.linear_model.CoxPHSurvivalAnalysis can now be set per
feature. If you want one or more features to enter the model unpenalized,
set the corresponding penalty weights to zero.
Finally, sklearn.pipeline.Pipeline will now be automatically patched
to add support for predict_cumulative_hazard_function and predict_survival_function
if the underlying estimator supports it.
Deprecations¶
- Add scikit-learn’s deprecation of presort in
sksurv.tree.SurvivalTreeandsksurv.ensemble.GradientBoostingSurvivalAnalysis. - Add warning that default alpha_min_ratio in
sksurv.linear_model.CoxnetSurvivalAnalysiswill depend on the ratio of the number of samples to the number of features in the future (#41).
Enhancements¶
- Add references to API doc of
sksurv.ensemble.GradientBoostingSurvivalAnalysis(#91). - Add support for pandas 1.0 (#100).
- Add ccp_alpha parameter for
Minimal Cost-Complexity Pruning
to
sksurv.ensemble.GradientBoostingSurvivalAnalysis. - Patch
sklearn.pipeline.Pipelineto add support for predict_cumulative_hazard_function and predict_survival_function if the underlying estimator supports it. - Allow per-feature regularization for
sksurv.linear_model.CoxPHSurvivalAnalysis(#102). - Clarify API docs of
sksurv.metrics.concordance_index_censored()(#96).
scikit-survival 0.11 (2019-12-21)¶
This release adds sksurv.tree.SurvivalTree and sksurv.ensemble.RandomSurvivalForest,
which are based on the log-rank split criterion.
It also adds the OSQP solver as option to sksurv.svm.MinlipSurvivalAnalysis
and sksurv.svm.HingeLossSurvivalSVM, which will replace the now deprecated
cvxpy and cvxopt options in a future release.
This release removes support for sklearn 0.20 and requires sklearn 0.21.
Deprecations¶
- The cvxpy and cvxopt options for solver in
sksurv.svm.MinlipSurvivalAnalysisandsksurv.svm.HingeLossSurvivalSVMare deprecated and will be removed in a future version. Choosing osqp is the preferred option now.
Enhancements¶
- Add support for pandas 0.25.
- Add OSQP solver option to
sksurv.svm.MinlipSurvivalAnalysisandsksurv.svm.HingeLossSurvivalSVMwhich has no additional dependencies. - Fix issue when using cvxpy 1.0.16 or later.
- Explicitly specify utf-8 encoding when reading README.rst (#89).
- Add
sksurv.tree.SurvivalTreeandsksurv.ensemble.RandomSurvivalForest(#90).
Bug fixes¶
- Exclude Cython-generated files from source distribution because they are not forward compatible.
scikit-survival 0.10 (2019-09-02)¶
This release adds the ties argument to sksurv.linear_model.CoxPHSurvivalAnalysis
to choose between Breslow’s and Efron’s likelihood in the presence of tied event times.
Moreover, sksurv.compare.compare_survival() has been added, which implements
the log-rank hypothesis test for comparing the survival function of 2 or more groups.
Enhancements¶
- Update API doc of predict function of boosting estimators (#75).
- Clarify documentation for GradientBoostingSurvivalAnalysis (#78).
- Implement Efron’s likelihood for handling tied event times.
- Implement log-rank test for comparing survival curves.
- Add support for scipy 1.3.1 (#66).
Bug fixes¶
- Re-add baseline_survival_ and cum_baseline_hazard_ attributes
to
sksurv.linear_model.CoxPHSurvivalAnalysis(#76).
scikit-survival 0.9 (2019-07-26)¶
This release adds support for sklearn 0.21 and pandas 0.24.
Enhancements¶
- Add reference to IPCRidge (#65).
- Use scipy.special.comb instead of deprecated scipy.misc.comb.
- Add support for pandas 0.24 and drop support for 0.20.
- Add support for scikit-learn 0.21 and drop support for 0.20 (#71).
- Explain use of intercept in ComponentwiseGradientBoostingSurvivalAnalysis (#68)
- Bump Eigen to 3.3.7.
Bug fixes¶
- Disallow scipy 1.3.0 due to scipy regression (#66).
scikit-survival 0.8 (2019-05-01)¶
Enhancements¶
- Add
sksurv.linear_model.CoxnetSurvivalAnalysis.predict_survival_function()andsksurv.linear_model.CoxnetSurvivalAnalysis.predict_cumulative_hazard_function()(#46). - Add
sksurv.nonparametric.SurvivalFunctionEstimatorandsksurv.nonparametric.CensoringDistributionEstimatorthat wrapsksurv.nonparametric.kaplan_meier_estimator()and provide a predict_proba method for evaluating the estimated function on test data. - Implement censoring-adjusted C-statistic proposed by Uno et al. (2011)
in
sksurv.metrics.concordance_index_ipcw(). - Add estimator of cumulative/dynamic AUC of Uno et al. (2007)
in
sksurv.metrics.cumulative_dynamic_auc(). - Add flchain dataset (see
sksurv.datasets.load_flchain()).
Bug fixes¶
- The tied_time return value of
sksurv.metrics.concordance_index_censored()now correctly reflects the number of comparable pairs that share the same time and that are used in computing the concordance index. - Fix a bug in
sksurv.metrics.concordance_index_censored()where a pair with risk estimates within tolerance was counted both as concordant and tied.
scikit-survival 0.7 (2019-02-27)¶
This release adds support for Python 3.7 and sklearn 0.20.
Changes:
- Add support for sklearn 0.20 (#48).
- Migrate to py.test (#50).
- Explicitly request ECOS solver for
sksurv.svm.MinlipSurvivalAnalysisandsksurv.svm.HingeLossSurvivalSVM. - Add support for Python 3.7 (#49).
- Add support for cvxpy >=1.0.
- Add support for numpy 1.15.
scikit-survival 0.6 (2018-10-07)¶
This release adds support for numpy 1.14 and pandas up to 0.23.
In addition, the new class sksurv.util.Surv makes it easier
to construct a structured array from numpy arrays, lists, or a pandas data frame.
Changes:
- Support numpy 1.14 and pandas 0.22, 0.23 (#36).
- Enable support for cvxopt with Python 3.5+ on Windows (requires cvxopt >=1.1.9).
- Add max_iter parameter to
sksurv.svm.MinlipSurvivalAnalysisandsksurv.svm.HingeLossSurvivalSVM. - Fix score function of
sksurv.svm.NaiveSurvivalSVMto use concordance index. sksurv.linear_model.CoxnetSurvivalAnalysisnow throws an exception if coefficients get too large (#47).- Add
sksurv.util.Survclass to ease constructing a structured array (#26).
scikit-survival 0.5 (2017-12-09)¶
This release adds support for scikit-learn 0.19 and pandas 0.21. In turn, support for older versions is dropped, namely Python 3.4, scikit-learn 0.18, and pandas 0.18.
scikit-survival 0.4 (2017-10-28)¶
This release adds sksurv.linear_model.CoxnetSurvivalAnalysis, which implements
an efficient algorithm to fit Cox’s proportional hazards model with LASSO, ridge, and
elastic net penalty.
Moreover, it includes support for Windows with Python 3.5 and later by making the cvxopt
package optional.
scikit-survival 0.3 (2017-08-01)¶
This release adds sksurv.linear_model.CoxPHSurvivalAnalysis.predict_survival_function()
and sksurv.linear_model.CoxPHSurvivalAnalysis.predict_cumulative_hazard_function(),
which return the survival function and cumulative hazard function using Breslow’s
estimator.
Moreover, it fixes a build error on Windows (gh #3)
and adds the sksurv.preprocessing.OneHotEncoder class, which can be used in
a scikit-learn pipeline.
scikit-survival 0.2 (2017-05-29)¶
This release adds support for Python 3.6, and pandas 0.19 and 0.20.
scikit-survival 0.1 (2016-12-29)¶
This is the initial release of scikit-survival. It combines the implementation of survival support vector machines with the code used in the Prostate Cancer DREAM challenge.