scikit-survival

scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizing the power of scikit-learn, e.g., for pre-processing or doing cross-validation.

The objective in survival analysis (also referred to as time-to-event or reliability analysis) is to establish a connection between covariates and the time of an event. What makes survival analysis differ from traditional machine learning is the fact that parts of the training data can only be partially observed – they are censored.

For instance, in a clinical study, patients are often monitored for a particular time period, and events occurring in this particular period are recorded. If a patient experiences an event, the exact time of the event can be recorded – the patient’s record is uncensored. In contrast, right censored records refer to patients that remained event-free during the study period and it is unknown whether an event has or has not occurred after the study ended. Consequently, survival analysis demands for models that take this unique characteristic of such a dataset into account.

Installation

The easiest way to install scikit-survival is to use Anaconda by running:

conda install -c sebp scikit-survival

Alternatively, you can install scikit-survival from source following this guide.

Documentation

Installing scikit-survival

This is the recommended and easiest to install scikit-survival is to use Anaconda. Alternatively, you can install scikit-survival From Source.

Anaconda

Pre-built binary packages for Linux, MacOS, and Windows are available for Anaconda. If you have Anaconda installed, run:

conda install -c sebp scikit-survival

From Source

If you want to build scikit-survival from source, you will need a C/C++ compiler to compile extensions.

Linux

On Linux, you need to install gcc, which in most cases is available via your distribution’s packaging system. Please follow your distribution’s instructions on how to install packages.

MacOS

On MacOS, you need to install clang, which is available from the Command Line Tools package. Open a terminal and excecute:

xcode-select --install

Alternatively, you can download it from the Apple Developers page. Log in with your Apple ID, then search and download the Command Line Tools for Xcode package.

Windows

On Windows, the compiler you need depends on the Python version you are using. See this guide to determine which Microsoft Visual C++ compiler to use with a specific Python version.

Latest Release

To install the latest release of scikit-survival from source, run:

pip install scikit-survival
Development Version

To install the latest source from our GitHub repository, you need to have Git installed and simply run:

pip install git+https://github.com/sebp/scikit-survival.git

Dependencies

The current minimum dependencies to run scikit-survival are:

  • Python 3.5 or later
  • cvxpy
  • cvxopt
  • joblib
  • numexpr
  • numpy 1.12 or later
  • osqp
  • pandas 0.21 or later
  • scikit-learn 0.22
  • scipy 1.0 or later
  • C/C++ compiler

Understanding Predictions in Survival Analysis

What is Survival Analysis?

The objective in survival analysis — also referred to as reliability analysis in engineering — is to establish a connection between covariates and the time of an event. The name survival analysis originates from clinical research, where predicting the time to death, i.e., survival, is often the main objective. Survival analysis is a type of regression problem (one wants to predict a continuous value), but with a twist. It differs from traditional regression by the fact that parts of the training data can only be partially observed – they are censored.

As an example, consider a clinical study, which investigates cardiovascular disease and has been carried out over a 1 year period as in the figure below.

censoring

Patient A was lost to follow-up after three months with no recorded cardiovascular event, patient B experienced an event four and a half months after enrollment, patient C withdrew from the study three and a half months after enrollment, and patient E did not experience any event before the study ended. Consequently, the exact time of a cardiovascular event could only be recorded for patients B and D; their records are uncensored. For the remaining patients it is unknown whether they did or did not experience an event after termination of the study. The only valid information that is available for patients A, C, and E is that they were event-free up to their last follow-up. Therefore, their records are censored.

Formally, each patient record consists of a set of covariates \(x \in \mathbb{R}^d\) , and the time \(t>0\) when an event occurred or the time \(c>0\) of censoring. Since censoring and experiencing and event are mutually exclusive, it is common to define an event indicator \(\delta \in \{0;1\}\) and the observable survival time \(y>0\). The observable time \(y\) of a right censored sample is defined as

\[\begin{split}y = \min(t, c) = \begin{cases} t & \text{if } \delta = 1 , \\ c & \text{if } \delta = 0 . \end{cases}\end{split}\]

Consequently, survival analysis demands for models that take this unique characteristic of such a dataset into account.

Basic Quantities

Rather than focusing on predicting a single point in time of an event, the prediction step in survival analysis often focuses on predicting a function: either the survival or hazard function. The survival function \(S(t)\) returns the probability of survival beyond time \(t\), i.e., \(S(t) = P(T > t)\), whereas the hazard function \(h(t)\) denotes an approximate probability (it is not bounded from above) that an event occurs in the small time interval \([t; t + \Delta t[\), under the condition that an individual would remain event-free up to time \(t\):

\[h(t) = \lim_{\Delta t \rightarrow 0} \frac{P(t \leq T < t + \Delta t \mid T \geq t)}{\Delta t} \geq 0 .\]

Alternative names for the hazard function are conditional failure rate, conditional mortality rate, or instantaneous failure rate. In contrast to the survival function, which describes the absence of an event, the hazard function provides information about the occurrence of an event. Finally, the cumulative hazard function \(H(t)\) is the integral over the interval \([0; t]\) of the hazard function:

\[H(t) = \int_0^t h(u)\,du .\]

Predictions

The survival function \(S(t)\) and cumulative hazard function \(H(t)\) can be estimated from a set of observed time points \(\{(y_1, \delta_i), \ldots, (y_n, \delta_n)\}\) using sksurv.nonparametric.kaplan_meier_estimator() and sksurv.nonparametric.nelson_aalen_estimator(), respectively.

The above estimators are often too simple, because they do not take additional factors into account that could affect survival, e.g. age or a pre-existing condition. Cox’s proportional hazards model (sksurv.linear_model.CoxPHSurvivalAnalysis) provides a way to estimate survival and cumulative hazard function in the presence of additional covariates. This is possible, because it assumes that a baseline hazard function exists and that covariates change the “risk” (hazard) only proportionally. In other words, it assumes that the ratio of the “risk” of experiencing an event of two patients remains constant over time. After fitting Cox’s proportional hazards model, \(S(t)\) and \(H(t)\) can be estimated using sksurv.linear_model.CoxPHSurvivalAnalysis.predict_survival_function() and sksurv.linear_model.CoxPHSurvivalAnalysis.predict_cumulative_hazard_function(), respectively.

Important

For other survival models that do not rely on the proportional hazards assumption, it is often impossible to estimate survival or cumulative hazard function. Their predictions are risk scores of arbitrary scale. If samples are ordered according to their predicted risk score (in ascending order), one obtains the sequence of events, as predicted by the model. This is the return value of the predict() method of all survival models in scikit-survival.

Consequently, predictions are often evaluated by a measure of rank correlation between predicted risk scores and observed time points in the test data. In particular, Harrell’s concordance index (sksurv.metrics.concordance_index_censored()) computes the ratio of correctly ordered (concordant) pairs to comparable pairs and is the default performance metric when calling a survival model’s score() method.

API reference

Datasets

get_x_y(data_frame, attr_labels[, …]) Split data frame into features and labels.
load_aids([endpoint]) Load and return the AIDS Clinical Trial dataset
load_arff_files_standardized(path_training, …) Load dataset in ARFF format.
load_breast_cancer() Load and return the breast cancer dataset
load_flchain() Load and return assay of serum free light chain for 7874 subjects.
load_gbsg2() Load and return the German Breast Cancer Study Group 2 dataset
load_whas500() Load and return the Worcester Heart Attack Study dataset
load_veterans_lung_cancer() Load and return data from the Veterans’ Administration Lung Cancer Trial

Ensemble Models

ComponentwiseGradientBoostingSurvivalAnalysis([…]) Gradient boosting with component-wise least squares as base learner.
GradientBoostingSurvivalAnalysis([loss, …]) Gradient-boosted Cox proportional hazard loss with regression trees as base learner.
RandomSurvivalForest([n_estimators, …]) A random survival forest.

Functions

StepFunction(x, y[, a, b]) Callable step function.

Hypothesis testing

compare_survival(y, group_indicator[, …]) K-sample log-rank hypothesis test of identical survival functions.

I/O Utilities

loadarff(filename) Load ARFF file
writearff(data, filename[, relation_name, index]) Write ARFF file

Kernels

ClinicalKernelTransform([fit_once, …]) Transform data using a clinical Kernel
clinical_kernel(x[, y]) Computes clinical kernel

Linear Models

CoxnetSurvivalAnalysis([n_alphas, alphas, …]) Cox’s proportional hazard’s model with elastic net penalty.
CoxPHSurvivalAnalysis([alpha, ties, n_iter, …]) Cox proportional hazards model.
IPCRidge([alpha, fit_intercept, normalize, …]) Accelerated failure time model with inverse probability of censoring weights.

Meta Models

EnsembleSelection(base_estimators[, scorer, …]) Ensemble selection for survival analysis that accounts for a score and correlations between predictions.
EnsembleSelectionRegressor(base_estimators) Ensemble selection for regression that accounts for the accuracy and correlation of errors.
Stacking(meta_estimator, base_estimators[, …]) Meta estimator that combines multiple base learners.

Metrics

brier_score(survival_train, survival_test, …) Estimate the time-dependent Brier score for right censored data.
concordance_index_censored(event_indicator, …) Concordance index for right-censored data
concordance_index_ipcw(survival_train, …) Concordance index for right-censored data based on inverse probability of censoring weights.
cumulative_dynamic_auc(survival_train, …) Estimator of cumulative/dynamic AUC for right-censored time-to-event data.
integrated_brier_score(survival_train, …) The Integrated Brier Score (IBS) provides an overall calculation of the model performance at all available times \(t_1 \leq t \leq t_\text{max}\).

Non-parametric Estimators

CensoringDistributionEstimator() Kaplan–Meier estimator for the censoring distribution.
SurvivalFunctionEstimator() Kaplan–Meier estimate of the survival function.
ipc_weights(event, time) Compute inverse probability of censoring weights
kaplan_meier_estimator(event, time_exit[, …]) Kaplan-Meier estimator of survival function.
nelson_aalen_estimator(event, time) Nelson-Aalen estimator of cumulative hazard function.

Pre-Processing

OneHotEncoder([allow_drop]) Encode categorical columns with M categories into M-1 columns according to the one-hot scheme.
categorical_to_numeric(table) Encode categorical columns to numeric by converting each category to an integer value.
encode_categorical(table[, columns]) Encode categorical columns with M categories into M-1 columns according to the one-hot scheme.
standardize(table[, with_std]) Perform Z-Normalization on each numeric column of the given table.

Survival Support Vector Machine

HingeLossSurvivalSVM([solver, alpha, …]) Naive implementation of kernel survival support vector machine.
FastKernelSurvivalSVM([alpha, rank_ratio, …]) Efficient Training of kernel Survival Support Vector Machine.
FastSurvivalSVM([alpha, rank_ratio, …]) Efficient Training of linear Survival Support Vector Machine
MinlipSurvivalAnalysis([solver, alpha, …]) Survival model related to survival SVM, using a minimal Lipschitz smoothness strategy instead of a maximal margin strategy.
NaiveSurvivalSVM([penalty, loss, dual, tol, …]) Naive version of linear Survival Support Vector Machine.

Survival Trees

SurvivalTree([splitter, max_depth, …]) A survival tree.

Utilities

Surv Helper class to construct structured array of event indicator and observed time.

Contributing Guidelines

This page explains how you can contribute to the development of scikit-survival. There are a lot of ways you can contribute:

  • Writing new code, e.g. implementations of new algorithms, or examples.
  • Fixing bugs.
  • Improving documentation.
  • Reviewing open pull requests.

scikit-survival is developed on GitHub using the Git version control system. The preferred way to contribute to scikit-survival is to fork the main repository on GitHub, then submit a pull request (PR).

Creating a fork

These are the steps you need to take to create a copy of the scikit-survival repository on your computer.

  1. Create an account on GitHub if you do not already have one.

  2. Fork the scikit-survival repository.

  3. Clone your fork of the scikit-survival repository from your GitHub account to your local disk. You have to execute from the command line:

    git clone --recurse-submodules git@github.com:YourLogin/scikit-survival.git
    cd scikit-survival
    

Setting up a Development Environment

After you created a copy of our main repository on GitHub, your need to setup a local development environment. We strongly recommend to use conda to create a separate virtual environment containing all dependencies. These are the steps you need to take.

  1. Install conda for your operating system if you haven’t already.

  2. Create a new environment, named sksurv:

    python ci/list-requirements.py requirements/dev.txt > dev-requirements.txt
    conda create -n sksurv -c sebp python=3 --file dev-requirements.txt
    
  3. Activate the newly created environment:

    conda activate sksurv
    
  4. Compile the C/C++ extensions and install scikit-survival in development mode:

    pip install --no-build-isolation -e .
    

Making Changes to the Code

For a pull request to be accepted, your changes must meet the below requirements.

  1. All changes related to one feature must belong to one branch. Each branch must be self-contained, with a single new feature or bugfix. Create a new feature branch by executing:

    git checkout -b my-new-feature
    
  2. All code must follow the standard Python guidelines for code style, PEP8. To check that your code conforms to PEP8, you can install tox and run:

    tox -e flake8
    
  3. Each function, class, method, and attribute needs to be documented using docstrings. scikit-survival conforms to the numpy docstring standard.

  4. Code submissions must always include unit tests. We are using pytest. All tests must be part of the tests directory. You can run the test suite locally by executing:

    py.test tests/
    

    Tests will also be executed automatically once you submit a pull request.

  5. The contributed code will be licensed under the GNU General Public License v3.0. If you did not write the code yourself, you must ensure the existing license is compatible and include the license information in the contributed files, or obtain a permission from the original author to relicense the contributed code.

Submitting a Pull Request

  1. When you are done coding in your feature branch, add changed or new files:

    git add path/to/modified_file
    
  2. Create a commit message describing what you changed. Commit messages should be clear and concise. The first line should contain the subject of the commit and not exceed 80 characters in length. If necessary, add a blank line after the subject followed by a commit message body with more details:

    git commit
    
  3. Push the changes to GitHub:

    git push -u origin my_feature
    
  4. Create a pull request.

Building the Documentation

The documentation resides in the doc/ folder and is written in reStructuredText. HTML files of the documentation can be generated using Sphinx. The easiest way to build the documentation is to install tox and run:

tox -e docs

Generated files will be located in doc/_build/html. To open the main page of the documentation, run:

xdg-open _build/html/index.html

Release Notes

scikit-survival 0.12 (2020-06-28)

The highlights of this release include the addition of sksurv.metrics.brier_score() and sksurv.metrics.integrated_brier_score() and compatibility with scikit-learn 0.23.

predict_survival_function and predict_cumulative_hazard_function of sksurv.ensemble.RandomSurvivalForest and sksurv.tree.SurvivalTree can now return an array of sksurv.functions.StepFunctions, similar to sksurv.linear_model.CoxPHSurvivalAnalysis by specifying return_array=False. This will be the default behavior starting with 0.14.0.

Note that this release fixes a bug in estimating inverse probability of censoring weights (IPCW), which will affect all estimators relying on IPCW.

Enhancements
Deprecations
Bug fixes

scikit-survival 0.12 (2020-04-15)

This release adds support for scikit-learn 0.22, thereby dropping support for older versions. Moreover, the regularization strength of the ridge penalty in sksurv.linear_model.CoxPHSurvivalAnalysis can now be set per feature. If you want one or more features to enter the model unpenalized, set the corresponding penalty weights to zero. Finally, sklearn.pipeline.Pipeline will now be automatically patched to add support for predict_cumulative_hazard_function and predict_survival_function if the underlying estimator supports it.

Deprecations
Enhancements

scikit-survival 0.11 (2019-12-21)

This release adds sksurv.tree.SurvivalTree and sksurv.ensemble.RandomSurvivalForest, which are based on the log-rank split criterion. It also adds the OSQP solver as option to sksurv.svm.MinlipSurvivalAnalysis and sksurv.svm.HingeLossSurvivalSVM, which will replace the now deprecated cvxpy and cvxopt options in a future release.

This release removes support for sklearn 0.20 and requires sklearn 0.21.

Deprecations
Enhancements
Bug fixes
  • Exclude Cython-generated files from source distribution because they are not forward compatible.

scikit-survival 0.10 (2019-09-02)

This release adds the ties argument to sksurv.linear_model.CoxPHSurvivalAnalysis to choose between Breslow’s and Efron’s likelihood in the presence of tied event times. Moreover, sksurv.compare.compare_survival() has been added, which implements the log-rank hypothesis test for comparing the survival function of 2 or more groups.

Enhancements
  • Update API doc of predict function of boosting estimators (#75).
  • Clarify documentation for GradientBoostingSurvivalAnalysis (#78).
  • Implement Efron’s likelihood for handling tied event times.
  • Implement log-rank test for comparing survival curves.
  • Add support for scipy 1.3.1 (#66).
Bug fixes

scikit-survival 0.9 (2019-07-26)

This release adds support for sklearn 0.21 and pandas 0.24.

Enhancements
  • Add reference to IPCRidge (#65).
  • Use scipy.special.comb instead of deprecated scipy.misc.comb.
  • Add support for pandas 0.24 and drop support for 0.20.
  • Add support for scikit-learn 0.21 and drop support for 0.20 (#71).
  • Explain use of intercept in ComponentwiseGradientBoostingSurvivalAnalysis (#68)
  • Bump Eigen to 3.3.7.
Bug fixes
  • Disallow scipy 1.3.0 due to scipy regression (#66).

scikit-survival 0.8 (2019-05-01)

Enhancements
Bug fixes

scikit-survival 0.7 (2019-02-27)

This release adds support for Python 3.7 and sklearn 0.20.

Changes:

scikit-survival 0.6 (2018-10-07)

This release adds support for numpy 1.14 and pandas up to 0.23. In addition, the new class sksurv.util.Surv makes it easier to construct a structured array from numpy arrays, lists, or a pandas data frame.

Changes:

scikit-survival 0.5 (2017-12-09)

This release adds support for scikit-learn 0.19 and pandas 0.21. In turn, support for older versions is dropped, namely Python 3.4, scikit-learn 0.18, and pandas 0.18.

scikit-survival 0.4 (2017-10-28)

This release adds sksurv.linear_model.CoxnetSurvivalAnalysis, which implements an efficient algorithm to fit Cox’s proportional hazards model with LASSO, ridge, and elastic net penalty. Moreover, it includes support for Windows with Python 3.5 and later by making the cvxopt package optional.

scikit-survival 0.3 (2017-08-01)

This release adds sksurv.linear_model.CoxPHSurvivalAnalysis.predict_survival_function() and sksurv.linear_model.CoxPHSurvivalAnalysis.predict_cumulative_hazard_function(), which return the survival function and cumulative hazard function using Breslow’s estimator. Moreover, it fixes a build error on Windows (gh #3) and adds the sksurv.preprocessing.OneHotEncoder class, which can be used in a scikit-learn pipeline.

scikit-survival 0.2 (2017-05-29)

This release adds support for Python 3.6, and pandas 0.19 and 0.20.

scikit-survival 0.1 (2016-12-29)

This is the initial release of scikit-survival. It combines the implementation of survival support vector machines with the code used in the Prostate Cancer DREAM challenge.

Indices and tables