sksurv.datasets.load_arff_files_standardized#

sksurv.datasets.load_arff_files_standardized(path_training, attr_labels, pos_label=None, path_testing=None, survival=True, standardize_numeric=True, to_numeric=True, *, output_type='pandas')[source]#

Load dataset in ARFF format.

Parameters:
  • path_training (str) – Path to ARFF file containing data.

  • attr_labels (sequence of str) – Names of attributes denoting dependent variables. If survival is set, it must be a sequence with two items: the name of the event indicator and the name of the survival/censoring time.

  • pos_label (any type, optional) – Value corresponding to an event in survival analysis. Only considered if survival is True.

  • path_testing (str, optional) – Path to ARFF file containing hold-out data. Only columns that are available in both training and testing are considered (excluding dependent variables). If standardize_numeric is set, data is normalized by considering both training and testing data.

  • survival (bool, optional, default: True) – Whether the dependent variables denote event indicator and survival/censoring time.

  • standardize_numeric (bool, optional, default: True) – Whether to standardize data to zero mean and unit variance. See sksurv.column.standardize().

  • to_numeric (bool, optional, default: True) – Whether to convert categorical variables to numeric values. See sksurv.column.categorical_to_numeric().

  • output_type ({"pandas", "polars"}, default="pandas") – Dataframe library used for the returned x_train / x_test. All derivations (concatenation, standardization, numeric conversion) run through narwhals in the requested dataframe library; there is no intermediate pandas conversion when output_type="polars".

Returns:

  • x_train (pandas.DataFrame or polars.DataFrame, shape = (n_train, n_features)) – Training data.

  • y_train (structured array, Series, DataFrame, or None) – Dependent variables of training data.

    If survival is True, a structured array with two fields. The first field is a boolean where True indicates an event and False indicates right-censoring. The second field is a float with the time of event or time of censoring.

    If survival is False and attr_labels is a single column name, a Series in the output_type dataframe library; if it is a sequence of column names, a DataFrame with those columns.

    If survival is False and attr_labels is None, y_train is set to None.

  • x_test (None, or pandas.DataFrame / polars.DataFrame of shape (n_test, n_features)) – Testing data if path_testing was provided. Dataframe library matches output_type.

  • y_test (structured array, Series, DataFrame, or None) – Dependent variables of testing data if path_testing was provided.

    If survival is True, a structured array with two fields. The first field is a boolean where True indicates an event and False indicates right-censoring. The second field is a float with the time of event or time of censoring.

    If survival is False and attr_labels is a single column name, a Series in the output_type dataframe library; if it is a sequence of column names, a DataFrame with those columns.

    If survival is False and attr_labels is None, y_test is set to None.