learner Package

Learner Class

An easy-to-use class that wraps scikit-learn estimators.

author:

Nitin Madnani (nmadnani@ets.org)

author:

Michael Heilman (mheilman@ets.org)

author:

Dan Blanchard (dblanchard@ets.org)

author:

Aoife Cahill (acahill@ets.org)

organization:

ETS

class skll.learner.Learner(model_type, probability=False, pipeline=False, feature_scaling='none', model_kwargs=None, pos_label=None, min_feature_count=1, sampler=None, sampler_kwargs=None, custom_learner_path=None, logger=None)[source]

Bases: object

A simpler interface around scikit-learn classification and regression estimators.

Parameters:
  • model_type (str) – Name of estimator to create (e.g., 'LogisticRegression'). See the skll package documentation for valid options.

  • probability (bool, default=False) – Should learner return probabilities of all labels (instead of just label with highest probability)?

  • pipeline (bool, default=False) – Should learner contain a pipeline attribute that contains a scikit-learn Pipeline object composed of all steps including the vectorizer, the feature selector, the sampler, the feature scaler, and the actual estimator. Note that this will increase the size of the learner object in memory and also when it is saved to disk.

  • feature_scaling (str, default="none") – How to scale the features, if at all. Options are - ‘with_std’: scale features using the standard deviation - ‘with_mean’: center features using the mean - ‘both’: do both scaling as well as centering - ‘none’: do neither scaling nor centering

  • model_kwargs (Optional[Dict[str, Any]], default=None) – A dictionary of keyword arguments to pass to the initializer for the specified model.

  • pos_label (Optional[skll.types.LabelType], default=None) – An integer or string denoting the label of the class to be treated as the positive class in a binary classification setting. If None, the class represented by the label that appears second when sorted is chosen as the positive class. For example, if the two labels in data are “A” and “B” and pos_label is not specified, “B” will be chosen as the positive class.

  • min_feature_count (int, default=1) – The minimum number of examples a feature must have a nonzero value in to be included.

  • sampler (Optional[str], default=None) – The sampler to use for kernel approximation, if desired. Valid values are - ‘AdditiveChi2Sampler’ - ‘Nystroem’ - ‘RBFSampler’ - ‘SkewedChi2Sampler’

  • sampler_kwargs (Optional[Dict[str, Any]], default=None) – A dictionary of keyword arguments to pass to the initializer for the specified sampler.

  • custom_learner_path (Optional[str], default=None) – Path to module where a custom classifier is defined.

  • logger (Optional[logging.Logger], default=None) – A logging object. If None is passed, get logger from __name__.

cross_validate(examples, stratified=True, cv_folds=10, cv_seed=123456789, grid_search=True, grid_search_folds=5, grid_jobs=None, grid_objective=None, output_metrics=[], prediction_prefix=None, param_grid=None, shuffle=False, save_cv_folds=True, save_cv_models=False, use_custom_folds_for_grid_search=True)[source]

Cross-validate the learner on the given training examples.

Parameters:
  • examples (skll.data.featureset.FeatureSet) – The FeatureSet instance to cross-validate learner performance on.

  • stratified (bool, default=True) – Should we stratify the folds to ensure an even distribution of labels for each fold?

  • cv_folds (Union[int, skll.types.FoldMapping], default=10) – The number of folds to use for cross-validation, or a mapping from example IDs to folds.

  • cv_seed (int, default=123456789) – The value for seeding the random number generator used to create the random folds. Note that this seed is only used if either grid_search or shuffle are set to True.

  • grid_search (bool, default=True) – Should we do grid search when training each fold? Note: This will make this take much longer.

  • grid_search_folds (Union[int, skll.types.FoldMapping], default=5) – The number of folds to use when doing the grid search, or a mapping from example IDs to folds.

  • grid_jobs (Optional[int], default=None) – The number of jobs to run in parallel when doing the grid search. If None or 0, the number of grid search folds will be used.

  • grid_objective (Optional[str], default=None) – The name of the objective function to use when doing the grid search. Must be specified if grid_search is True.

  • output_metrics (List[str], default = []) – List of additional metric names to compute in addition to the metric used for grid search.

  • prediction_prefix (Optional[str], default=None) – If saving the predictions, this is the prefix that will be used for the filename. It will be followed by "_predictions.tsv"

  • param_grid (Optional[Dict[str, Any]], default=None) – The parameter grid to search.

  • shuffle (bool, default=False) – Shuffle examples before splitting into folds for CV.

  • save_cv_folds (bool, default=True) – Whether to save the cv fold ids or not?

  • save_cv_models (bool, default=False) – Whether to save the cv models or not?

  • use_custom_folds_for_grid_search (bool, default=True) – If cv_folds is a custom dictionary, but grid_search_folds is not, perhaps due to user oversight, should the same custom dictionary automatically be used for the inner grid-search cross-validation?

Returns:

A 5-tuple containing the following:

List[skll.types.EvaluateTaskResults]: the confusion matrix, overall accuracy, per-label PRFs, model parameters, objective function score, and evaluation metrics (if any) for each fold.

List[float]: the grid search scores for each fold.

List[Dict[str, Any]]: list of dictionaries of grid search CV results, one per fold, with keys such as “params”, “mean_test_score”, etc, that are mapped to lists of values associated with each hyperparameter set combination.

Optional[skll.types.FoldMapping]: dictionary containing the test-fold number for each id if save_cv_folds is True, otherwise None.

Optional[List[skll.learner.Learner]]: list of learners, one for each fold if save_cv_models is True, otherwise None.

Return type:

skll.types.CrossValidateTaskResults

Raises:
  • ValueError – If classification labels are not properly encoded as strings.

  • ValueError – If grid_search is True but grid_objective is None.

evaluate(examples, prediction_prefix=None, append=False, grid_objective=None, output_metrics=[])[source]

Evaluate the learner on a given dev or test FeatureSet.

Parameters:
  • examples (skll.data.featureset.FeatureSet) – The FeatureSet instance to evaluate the performance of the model on.

  • prediction_prefix (Optional[str], default=None) – If not None, predictions will also be written out to a file with the name <prediction_prefix>_predictions.tsv. Note that the prefix can also contain a path.

  • append (bool, default=False) – Should we append the current predictions to the file if it exists?

  • grid_objective (Optional[str], default=None) – The objective function that was used when doing the grid search.

  • output_metrics (List[str], default=[]) – List of additional metric names to compute in addition to grid objective.

Returns:

A 6-tuple containing the confusion matrix, the overall accuracy, the per-label PRFs, the model parameters, the grid search objective function score, and the additional evaluation metrics, if any. For regressors, the first two elements in the tuple are None.

Return type:

skll.types.EvaluateTaskResults

classmethod from_file(learner_path, logger=None)[source]

Load a saved Learner instance from a file path.

Parameters:
  • learner_path (skll.types.PathOrStr) – The path to a saved Learner instance file.

  • logger (Optional[logging.Logger], default=None) – A logging object. If None is passed, get logger from __name__.

Returns:

The Learner instance loaded from the file.

Return type:

skll.learner.Learner

get_feature_names_out()[source]

Return the names of the actual features used by the estimator.

It is possible for some features to get filtered out by the feature selector which means that the vectorizer is no longer the correct source for the feature names. This method takes into account the feature selector and returns the names of the features that were actually selected to be used by the estimator.

Returns:

names – Names of features actually used by the estimator.

Return type:

numpy.ndarray of shape (num_features,)

Raises:

ValueError – If self.feat_vectorizer is either None or a sklearn.feature_extraction.FeatureHasher.

learning_curve(examples, metric, cv_folds=10, train_sizes=array([0.1, 0.325, 0.55, 0.775, 1.]), override_minimum=False)[source]

Generate learning curves for the learner using the examples.

The learning curves are generated on the training examples via cross-validation. Adapted from the scikit-learn code for learning curve generation (cf.``sklearn.model_selection.learning_curve``).

Parameters:
  • examples (skll.data.featureset.FeatureSet) – The FeatureSet instance to generate the learning curve on.

  • cv_folds (Union[int, skll.types.FoldMapping], default=10) – The number of folds to use for cross-validation, or a mapping from example IDs to folds.

  • metric (str) – The name of the metric function to use when computing the train and test scores for the learning curve.

  • train_sizes (skll.types.LearningCurveSizes, default= numpy.linspace() with start=0.1, stop=1.0, num=5) – Relative or absolute numbers of training examples that will be used to generate the learning curve. If the type is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.

  • override_minimum (bool, default=False) – Learning curves can be unreliable for very small sizes esp. for > 2 labels. If this option is set to True, the learning curve would be generated even if the number of example is less 500 along with a warning. If False, the curve is not generated and an exception is raised instead.

Returns:

  • train_scores (List[float]) – The scores for the training set.

  • test_scores (List[float]) – The scores on the test set.

  • fit_times (List[float]) – The average times taken to fit each model.

  • num_examples (List[int]) – The numbers of training examples used to generate the curve.

Raises:

ValueError – If the number of examples is less than 500.

Return type:

Tuple[List[float], List[float], List[float], List[int]]

load(learner_path)[source]

Replace the current learner instance with a saved learner.

Parameters:

learner_path (skll.types.PathOrStr) – The path to a saved learner object file to load.

Return type:

None

property model

Return the underlying scikit-learn model.

property model_kwargs: Dict[str, Any]

Return a dictionary of the underlying scikit-learn model’s keyword arguments.

property model_params: Tuple[Dict[str, Any], Dict[str, Any]]

Return model parameters (i.e., weights).

Return the weights for a LinearModel (e.g., Ridge), regression, and liblinear models. If the model was trained using feature hashing, then names of the form hashed_feature_XX are used instead.

Returns:

  • res (Dict[str, Any]) – A dictionary of labeled weights.

  • intercept (Dict[str, Any]) – A dictionary of intercept(s).

Raises:

ValueError – If the instance does not support model parameters.

property model_type

Return the model type (i.e., the class).

predict(examples, prediction_prefix=None, append=False, class_labels=True)[source]

Generate predictions for the given examples using the learner model.

Return, and optionally, write out predictions on a given FeatureSet to a file. For regressors, the returned and written-out predictions are identical. However, for classifiers:

  • if class_labels is True, class labels are returned as well as written out.

  • if class_labels is False and the classifier is probabilistic (i.e., self..probability is True), class probabilities are returned as well as written out.

  • if class_labels is False and the classifier is non-probabilistic (i.e., self..probability is False), class indices are returned and class labels are written out.

TL;DR: for regressors, just ignore class_labels. For classfiers, set it to True to get class labels and False to get class probabilities.

Parameters:
  • examples (skll.data.featureset.FeatureSet) – The FeatureSet instance to predict labels for.

  • prediction_prefix (Optional[str], default=None) – If not None, predictions will also be written out to a file with the name <prediction_prefix>_predictions.tsv. For classifiers, the predictions written out are class labels unless the learner is probabilistic AND class_labels is set to False. Note that this prefix can also contain a path.

  • append (bool, default=False) – Should we append the current predictions to the file if it exists?

  • class_labels (bool, default=True) – If False, return either the class probabilities (probabilistic classifiers) or the class indices (non-probabilistic ones). If True, return the class labels no matter what. Ignored for regressors.

Returns:

The predictions returned by the Learner instance.

Return type:

numpy.ndarray

Raises:
  • AssertionError – If invalid predictions are being returned or written out.

  • MemoryError – If process runs out of memory when converting to dense.

  • RuntimeError – If there is a mismatch between the learner vectorizer and the test set vectorizer.

property probability: bool

Return the value of the probability flag.

The flag indicages whether the learner return probabilities of all labels (instead of just label with highest probability)?

save(learner_path)[source]

Save the Learner instance to a file.

Parameters:

learner_path (skll.types.PathOrStr) – The path to save the Learner instance to.

Return type:

None

train(examples, param_grid=None, grid_search_folds=5, grid_search=True, grid_objective=None, grid_jobs=None, shuffle=False)[source]

Train model underlying the learner.

Return the grid search score and a dictionary of grid search results.

Parameters:
  • examples (skll.data.featureset.FeatureSet) – The FeatureSet instance to use for training.

  • param_grid (Optional[Dict[str, Any]], default=None) – The parameter grid to search through for grid search. If None, a default parameter grid will be used.

  • grid_search_folds (Union[int, skll.types.FoldMapping], default=5) – The number of folds to use when doing the grid search, or a mapping from example IDs to folds.

  • grid_search (bool, default=True) – Should we do grid search?

  • grid_objective (Optional[str], default=None) – The name of the objective function to use when doing the grid search. Must be specified if grid_search is True.

  • grid_jobs (Optional[int], default=None) – The number of jobs to run in parallel when doing the grid search. If None or 0, the number of grid search folds will be used.

  • shuffle (bool, default=False) – Shuffle examples (e.g., for grid search CV.)

Returns:

  • float – The best grid search objective function score, or 0 if we’re not doing grid search

  • Dict[str, Any] – Dictionary of grid search CV results with keys such as “params”, “mean_test_score”, etc, that are mapped to lists of values associated with each hyperparameter set combination, or None if not doing grid search.

Raises:
  • ValueError – If grid_objective is not a valid grid objective or if one is not specified when necessary.

  • MemoryError – If process runs out of memory converting training data to dense.

  • ValueError – If FeatureHasher is used with MultinomialNB.

Return type:

Tuple[float, Dict[str, Any]]

skll.learner.load_custom_learner(custom_learner_path, custom_learner_name)[source]

Import and load the custom learner object from the given path.

Parameters:
  • custom_learner_path (skll.types.PathOrStr) – The path to a custom learner.

  • custom_learner_name (str) – The name of a custom learner.

Returns:

The SKLL learner object loaded from the given path.

Return type:

skll.learner.Learner

Raises:

ValueError – If the custom learner path does not end in ‘.py’.

VotingLearner Class

A meta-learner class that wraps scikit-learn’s VotingClassifier and VotingRegressor.

author:

Nitin Madnani (nmadnani@ets.org)

organization:

ETS

class skll.learner.voting.VotingLearner(learner_names, voting='hard', custom_learner_path=None, feature_scaling='none', pos_label=None, min_feature_count=1, model_kwargs_list=None, sampler_list=None, sampler_kwargs_list=None, logger=None)[source]

Bases: object

Wrap VotingClassifier and VotingRegressor from scikit-learn.

Note that this class does not inherit from the Learner class but rather uses different Learner instances underlyingly.

Parameters:
  • learner_names (List[str]) – List of the learner names that will participate in the voting process.

  • voting (Optional[str], default="hard") – One of “hard” or “soft”. If “hard”, the predicted class labels are used for majority rule voting. If “soft”, the predicted class label is based on the argmax of the sums of the predicted probabilities from each of the underlying learnrs. This parameter is only relevant for classification.

  • custom_learner_path (Optional[skll.types.PathOrStr], default=None) – Path to a Python file containing the definitions of any custom learners. Any and all custom learners in estimator_names must be defined in this file. If the custom learner does not inherit from an already existing scikit-learn estimator, it must explicitly define an _estimator_type attribute indicating whether it’s a “classifier” or a “regressor”.

  • feature_scaling (str, default="none") – How to scale the features, if at all for each estimator. Options are - “with_std”: scale features using the standard deviation - “with_mean”: center features using the mean - “both”: do both scaling as well as centering - “none”: do neither scaling nor centering

  • pos_label (Optional[skll.types.LabelType], default=None) – A string denoting the label of the class to be treated as the positive class in a binary classification setting, for each estimator. If None, the class represented by the label that appears second when sorted is chosen as the positive class. For example, if the two labels in data are “A” and “B” and pos_label is not specified, “B” will be chosen as the positive class.

  • min_feature_count (int, default=1) – The minimum number of examples a feature must have a nonzero value in to be included, for each estimator.

  • model_kwargs_list (Optional[List[Dict[str, Any]]], default=None) – A list of dictionaries of keyword arguments to pass to the initializer for each of the estimators. There’s a one-to-one correspondence between the order of this list and the order of the learner_names list.

  • sampler_list (Optional[List[str]], default=None) – The samplers to use for kernel approximation, if desired, for each estimator. Valid values are: - “AdditiveChi2Sampler” - “Nystroem” - “RBFSampler” - “SkewedChi2Sampler” There’s a one-to-one correspondence between the order of this list and the order of the learner_names list.

  • sampler_kwargs_list (Optional[List[Dict[str, Any]]], default=None) – A list of dictionaries of keyword arguments to pass to the initializer for the specified sampler, one per estimator. There’s a one-to-one correspondence between the order of this list and the order of the learner_names list.

  • logger (Optional[logging.Logger], default=None) – A logging object. If None is passed, get logger from __name__.

cross_validate(examples, stratified=True, cv_folds=10, cv_seed=123456789, grid_search=True, grid_search_folds=5, grid_jobs=None, grid_objective=None, output_metrics=[], prediction_prefix=None, param_grid_list=None, shuffle=False, save_cv_folds=True, save_cv_models=False, individual_predictions=False, use_custom_folds_for_grid_search=True)[source]

Cross-validate the meta-estimator on the given examples.

We follow essentially the same methodology as in Learner.cross_validate() - split the examples into training and testing folds, and then call self.train() on the training folds and then self.evaluate() on the test fold. Note that this means that underlying estimators with different hyperparameters may be used for each fold, as is the case with Learner.cross_validate().

Parameters:
  • examples (skll.data.featureset.FeatureSet) – The FeatureSet instance to cross-validate learner performance on.

  • stratified (bool, default=True) – Should we stratify the folds to ensure an even distribution of labels for each fold?

  • cv_folds (Union[int, skll.types.FoldMapping], default=10) – The number of folds to use for cross-validation, or a mapping from example IDs to folds.

  • cv_seed (int, default=123456789) – The value for seeding the random number generator used to create the random folds. Note that this seed is only used if either grid_search or shuffle are set to True.

  • grid_search (bool, default=True) – Should we do grid search when training each fold? Note: This will make this take much longer.

  • grid_search_folds (Union[int, skll.types.FoldMapping], default=5) – The number of folds to use when doing the grid search, or a mapping from example IDs to folds.

  • grid_jobs (Optional[int], default=None) – The number of jobs to run in parallel when doing the grid search. If None or 0, the number of grid search folds will be used.

  • grid_objective (Optional[str], default=None) – The name of the objective function to use when doing the grid search. Must be specified if grid_search is True.

  • output_metrics (Optional[List[str]], default=[]) – List of additional metric names to compute in addition to the metric used for grid search.

  • prediction_prefix (Optional[str], default=None) – If saving the predictions, this is the prefix that will be used for the filename. It will be followed by "_predictions.tsv"

  • param_grid_list (Optional[List[Dict[str, Any]]], default=None) – The list of parameters grid to search through for grid search, one for each underlying learner. The order of the dictionaries should correspond to the order If None, the default parameter grids will be used for the underlying estimators.

  • shuffle (bool, default=False) – Shuffle examples before splitting into folds for CV.

  • save_cv_folds (bool, default=True) – Whether to save the cv fold ids or not?

  • save_cv_models (bool, default=False) – Whether to save the cv models or not?

  • individual_predictions (bool, default=False) – Write out the cross-validated predictions from each underlying learner as well.

  • use_custom_folds_for_grid_search (bool, default=True) – If cv_folds is a custom dictionary, but grid_search_folds is not, perhaps due to user oversight, should the same custom dictionary automatically be used for the inner grid-search cross-validation?

Returns:

A 3-tuple containing the following:

List[skll.types.EvaluateTaskResults]: the confusion matrix, overall accuracy, per-label PRFs, model parameters, objective function score, and evaluation metrics (if any) for each fold.

Optional[skll.types.FoldMapping]: dictionary containing the test-fold number for each id if save_cv_folds is True, otherwise None.

Optional[List[skll.learner.voting.VotingLearner]]: list of voting learners, one for each fold if save_cv_models is True, otherwise None.

Return type:

skll.types.CrossValidateTaskResults

Raises:
  • ValueError – If classification labels are not properly encoded as strings.

  • ValueError – If grid_search is True but grid_objective is None.

evaluate(examples, prediction_prefix=None, append=False, grid_objective=None, individual_predictions=False, output_metrics=[])[source]

Evaluate the meta-estimator on a given FeatureSet.

Parameters:
  • examples (skll.data.featureset.FeatureSet) – The FeatureSet instance to evaluate the performance of the model on.

  • prediction_prefix (Optional[str], default=None) – If saving the predictions, this is the prefix that will be used for the filename. It will be followed by "_predictions.tsv"

  • append (bool, default=False) – Should we append the current predictions to the file if it exists?

  • grid_objective (Optional[str], default=None) – The objective function used when doing the grid search.

  • individual_predictions (bool, default=False) – Optionally, write out the predictions from each underlying learner.

  • output_metrics (List[str], default=[]) – List of additional metric names to compute in addition to grid objective.

Returns:

The confusion matrix, the overall accuracy, the per-label PRFs, the model parameters, the grid search objective function score, and the additional evaluation metrics, if any.

Return type:

skll.types.EvaluateTaskResults

classmethod from_file(learner_path, logger=None)[source]

Load a saved VotingLearner instance from a file.

Parameters:
  • learner_path (skll.types.PathOrStr) – The path to a saved VotingLearner instance file.

  • logger (Optional[logging.Logger], default=None) – A logging object. If None is passed, get logger from __name__.

Returns:

learner – The VotingLearner instance loaded from the file.

Return type:

skll.learner.voting.VotingLearner

property learners: List[Learner]

Return the underlying list of learners.

learning_curve(examples, metric, cv_folds=10, train_sizes=array([0.1, 0.325, 0.55, 0.775, 1.]), override_minimum=False)[source]

Generate learning curves for the meta-estimator.

Generate learning curves for the voting meta-estimator on the training examples via cross-validation. Adapted from the scikit-learn code for learning curve generation (cf.``sklearn.model_selection.learning_curve``).

Parameters:
  • examples (skll.data.featureset.FeatureSet) – The FeatureSet instance to generate the learning curve on.

  • metric (str) – The name of the metric function to use when computing the train and test scores for the learning curve.

  • cv_folds (Union[int, skll.types.FoldMapping], default=10) – The number of folds to use for cross-validation, or a mapping from example IDs to folds.

  • train_sizes (skll.types.LearningCurveSizes, default= numpy.linspace() with start=0.1, stop=1.0, num=5) – Relative or absolute numbers of training examples that will be used to generate the learning curve. If the type is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.

  • override_minimum (bool, default=False) – Learning curves can be unreliable for very small sizes esp. for > 2 labels. If this option is set to True, the learning curve would be generated even if the number of example is less 500 along with a warning. If False, the curve is not generated and an exception is raised instead.

Returns:

  • train_scores (List[float]) – The scores for the training set.

  • test_scores (List[float]) – The scores on the test set.

  • fit_times (List[float]) – The average times taken to fit each model.

  • num_examples (List[int]) – The numbers of training examples used to generate the curve.

Raises:

ValueError – If the number of examples is less than 500.

Return type:

Tuple[List[float], List[float], List[float], List[int]]

property model

Return underlying scikit-learn meta-estimator model.

property model_type

Return meta-estimator model type (i.e., the class).

predict(examples, prediction_prefix=None, append=False, class_labels=True, individual_predictions=False)[source]

Generate predictions with meta-estimator.

Compute the predictions from the meta-estimator and, optionally, the underlying estimators on given FeatureSet. The predictions are also written to disk if prediction_prefix is not None.

For regressors, the returned and written-out predictions are identical. However, for classifiers:

  • if class_labels is True, class labels are returned as well as written out.

  • if class_labels is False and the classifier is probabilistic (i.e., self.probability is True), class probabilities are returned as well as written out.

  • if class_labels is False and the classifier is non-probabilistic (i.e., self..probability is False), class indices are returned and class labels are written out. This option is generally only meant for SKLL-internal use.

Parameters:
  • examples (skll.data.featureset.FeatureSet) – The FeatureSet instance to predict labels for.

  • prediction_prefix (Optional[str], default=None) – If saving the predictions, this is the prefix that will be used for the filename. It will be followed by "_predictions.tsv"

  • append (bool, default=False) – Should we append the current predictions to the file if it exists?

  • class_labels (bool, default=True) – For classifier, should we convert class indices to their (str) labels for the returned array? Note that class labels are always written out to disk.

  • individual_predictions (bool, default=False) – Return (and, optionally, write out) the predictions from each underlying learner.

Returns:

The first element is the array of predictions returned by the meta-estimator and the second is an optional dictionary with the name of each underlying learner as the key and the array of its predictions as the value. The second element is None if individual_predictions is set to False.

Return type:

Tuple[numpy.ndarray, Optional[Dict[str, numpy.ndarray]]]

save(learner_path)[source]

Save the VotingLearner instance to a file.

Parameters:

learner_path (skll.types.PathOrStr) – The path to save the VotingLearner instance to.

Return type:

None

train(examples, param_grid_list=None, grid_search_folds=5, grid_search=True, grid_objective=None, grid_jobs=None, shuffle=False)[source]

Train the voting meta-estimator.

First, we train each of the underlying estimators (represented by a skll Learner), possibly with grid search. Then, we instantiate a VotingClassifier or VotingRegressor as appropriate with the scikit-learn Pipeline stored in the pipeline attribute of each trained Learner instance as the estimator. Finally, we call fit() on the VotingClassifier or VotingRegressor instance. We follow this process because it allows us to use grid search to find good hyperparameter values for our underlying learners before passing them to the meta-estimator AND because it allows us to use SKLL featuresets and do all of the same pre-processing when doing inference.

The trained meta-estimator is saved in the _model attribute. Nothing is returned.

Parameters:
  • examples (skll.data.featureset.FeatureSet) – The FeatureSet instance to use for training.

  • param_grid_list (Optional[List[Dict[str, Any]]], default=None) – The list of parameter grids to search through for grid search, one for each underlying learner. The order of the dictionaries should correspond to the order in which the underlying estimators were specified when the VotingLearner was instantiated. If None, the default parameter grids will be used for the underlying estimators.

  • grid_search_folds (Union[int, skll.types.FoldMapping], default=5) – The number of folds to use when doing the grid search for each of the underlying learners, or a mapping from example IDs to folds.

  • grid_search (bool, default=True) – Should we use grid search when training each underlying learner?

  • grid_objective (Optional[str], default=None) – The name of the objective function to use when doing the grid search for each underlying learner. Must be specified if grid_search is True.

  • grid_jobs (Optional[int], default=None) – The number of jobs to run in parallel when doing the grid search for each underlying learner. If None or 0, the number of grid search folds will be used.

  • shuffle (bool, default=False) – Shuffle examples (e.g., for grid search CV.)

Return type:

None