Tutorial
Before doing anything below, you’ll want to install SKLL.
Workflow
In general, there are four steps to using SKLL:
Get some data in a SKLL-compatible format.
Create a small configuration file describing the machine learning experiment you would like to run.
Run that configuration file with run_experiment.
Examine the results of the experiment.
Titanic Example
Let’s see how we can apply the basic workflow above to a simple example using the Titanic: Machine Learning from Disaster data from Kaggle.
Create virtual environment with SKLL
Before we proceed further, we need to install SKLL. The easiest way to do this is in a virtual environment. For this tutorial, we will use conda for creating our virtual environment as follows:
conda create -n skllenv -c conda-forge -c ets python=3.11 skll
This will create a new virtual environment named skllenv
with the
latest release of SKLL which you can then activate by running
conda activate skllenv
. Make sure to create and activate this environment
before proceeding further. Once you are done with the tutorial, you may
deactivate the virtual environment by running conda deactivate
.
Get your data into the correct format
The first step is to get the Titanic data. We have already downloaded the data files from Kaggle and included them in the SKLL repository. Next, we need to get the files and process them to get them in the right shape.
The provided script, make_titanic_example_data.py
, will split the train and test data files
from Kaggle up into groups of related features and store them in
dev
, test
, train
, and train+dev
subdirectories.
The development set that gets created by the script is 20% of the data
that was in the original training set, and train
contains the other 80%.
Create a configuration file for the experiment
For this tutorial, we will refer to an “experiment” as having a single data set split into training and testing portions. As part of each experiment, we can train and test several models, either simultaneously or sequentially, depending whether we’re using GridMap or not. This will be described in more detail later on, when we are ready to run our experiment.
You can consult the full list of learners currently available in SKLL to get an idea for the things you can do. As part of this tutorial, we will use the following classifiers:
Decision Tree
Multinomial Naïve Bayes
Random Forest
Support Vector Machine
[General]
experiment_name = Titanic_Evaluate_Tuned
task = evaluate
[Input]
# this could also be an absolute path instead (and must be if you're not
# running things in local mode)
train_directory = train
test_directory = dev
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]]
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"]
label_col = Survived
id_col = PassengerId
[Tuning]
grid_search = true
grid_search_folds = 3
objectives = ['accuracy']
[Output]
# again, these can be absolute paths
metrics = ['roc_auc']
probability = true
logs = output
results = output
predictions = output
models = output
Let’s take a look at the options specified in titanic/evaluate_tuned.cfg
.
Here, we are only going to train a model and evaluate its performance on the
development set, because in the General section, task is set to
evaluate. We will explore the other options for task
later.
In the Input section, we have specified relative paths to the training
and testing directories via the train_directory and
test_directory settings respectively.
featuresets indicates the name of both the training and
testing files. learners must always be specified in between [
]
brackets, even if you only want to use one learner. This is similar to the
featuresets option, which requires two sets of brackets,
since multiple sets of different-yet-related features can be provided. We will
keep our examples simple, however, and only use one set of features per
experiment. The label_col and id_col
settings specify the columns in the CSV files that specify the class labels and
instances IDs for each example.
The Tuning section defines how we want our model to be tuned. Setting
grid_search to True
here employs scikit-learn’s
GridSearchCV
class, which is an implementation of the
standard, brute-force approach to hyperparameter optimization.
objectives refers to the desired objective functions; here,
accuracy
will optimize for overall accuracy. You can see a list of all the
available objective functions here.
In the Output section, we first define the additional evaluation metrics we want to compute in addition to the tuning objective via the metrics option. The other options are directories where you’d like all of the relevant output from your experiment to go. results refers to the results of the experiment in both human-readable and JSON forms. logs specifies where to put log files containing any status, warning, or error messages generated during model training and evaluation. predictions refers to where to store the individual predictions generated for the test set. models is for specifying a directory to serialize the trained models.
Running your configuration file through run_experiment
Getting your experiment running is the simplest part of using SKLL, you just need to type the following into a terminal:
$ run_experiment titanic/evaluate_tuned.cfg
Make sure you have the skllenv
environment activated before you run
this command which should produce output like:
2020-03-10 14:25:23,596 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_RandomForestClassifier - INFO - Task: evaluate
2020-03-10 14:25:23,596 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_RandomForestClassifier - INFO - Training on train, Test on dev, feature set ['family.csv', 'misc.csv', 'socioeconomic.csv', 'vitals.csv'] ...
Loading /Users/nmadnani/work/skll/examples/titanic/train/family.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/train/misc.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/train/socioeconomic.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/train/vitals.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/dev/family.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/dev/misc.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/dev/socioeconomic.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/dev/vitals.csv... done
2020-03-10 14:25:23,662 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_RandomForestClassifier - INFO - Featurizing and training new RandomForestClassifier model
2020-03-10 14:25:23,663 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_RandomForestClassifier - WARNING - Training data will be shuffled to randomize grid search folds. Shuffling may yield different results compared to scikit-learn.
2020-03-10 14:25:28,129 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_RandomForestClassifier - INFO - Best accuracy grid search score: 0.798
2020-03-10 14:25:28,130 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_RandomForestClassifier - INFO - Hyperparameters: bootstrap: True, ccp_alpha: 0.0, class_weight: None, criterion: gini, max_depth: 5, max_features: auto, max_leaf_nodes: None, max_samples: None, min_impurity_decrease: 0.0, min_impurity_split: None, min_samples_leaf: 1, min_samples_split: 2, min_weight_fraction_leaf: 0.0, n_estimators: 500, n_jobs: None, oob_score: False, random_state: 123456789, verbose: 0, warm_start: False
2020-03-10 14:25:28,130 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_RandomForestClassifier - INFO - Evaluating predictions
2020-03-10 14:25:28,172 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_RandomForestClassifier - INFO - using probabilities for the positive class to compute "roc_auc" for evaluation.
2020-03-10 14:25:28,178 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_DecisionTreeClassifier - INFO - Task: evaluate
2020-03-10 14:25:28,178 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_DecisionTreeClassifier - INFO - Training on train, Test on dev, feature set ['family.csv', 'misc.csv', 'socioeconomic.csv', 'vitals.csv'] ...
Loading /Users/nmadnani/work/skll/examples/titanic/train/family.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/train/misc.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/train/socioeconomic.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/train/vitals.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/dev/family.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/dev/misc.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/dev/socioeconomic.csv... done
Loading /Users/nmadnani/work/skll/examples/titanic/dev/vitals.csv... done
2020-03-10 14:25:28,226 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_DecisionTreeClassifier - INFO - Featurizing and training new DecisionTreeClassifier model
2020-03-10 14:25:28,226 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_DecisionTreeClassifier - WARNING - Training data will be shuffled to randomize grid search folds. Shuffling may yield different results compared to scikit-learn.
2020-03-10 14:25:28,269 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_DecisionTreeClassifier - INFO - Best accuracy grid search score: 0.754
2020-03-10 14:25:28,269 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_DecisionTreeClassifier - INFO - Hyperparameters: ccp_alpha: 0.0, class_weight: None, criterion: gini, max_depth: None, max_features: None, max_leaf_nodes: None, min_impurity_decrease: 0.0, min_impurity_split: None, min_samples_leaf: 1, min_samples_split: 2, min_weight_fraction_leaf: 0.0, presort: deprecated, random_state: 123456789, splitter: best
2020-03-10 14:25:28,269 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_DecisionTreeClassifier - INFO - Evaluating predictions
2020-03-10 14:25:28,272 - Titanic_Evaluate_Tuned_family.csv+misc.csv+socioeconomic.csv+vitals.csv_DecisionTreeClassifier - INFO - using probabilities for the positive class to compute "roc_auc" for evaluation.
We could squelch the warnings about shuffling by setting
shuffle to True
in the Input section.
The reason we see the loading messages repeated is that we are running the different learners sequentially, whereas SKLL is designed to take advantage of a cluster to execute everything in parallel via GridMap.
Examine the results
As a result of running our experiment, there will be a whole host of files in our results directory. They can be broken down into three types of files:
.results
files, which contain a human-readable summary of the experiment, complete with confusion matrix..results.json
files, which contain all of the same information as the.results
files, but in a format more well-suited to automated processing.A summary
.tsv
file, which contains all of the information in all of the.results.json
files with one line per file. This is very nice if you’re trying many different learners and want to compare their performance. If you do additional experiments later (with a different config file), but would like one giant summary file, you can use the summarize_results command.
An example of a human-readable results file for our Titanic experiment is:
Experiment Name: Titanic_Evaluate_Tuned
SKLL Version: 2.1
Training Set: train
Training Set Size: 569
Test Set: dev
Test Set Size: 143
Shuffle: False
Feature Set: ["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]
Learner: RandomForestClassifier
Task: evaluate
Feature Scaling: none
Grid Search: True
Grid Search Folds: 3
Grid Objective Function: accuracy
Additional Evaluation Metrics: ['roc_auc']
Scikit-learn Version: 0.22.2.post1
Start Timestamp: 10 Mar 2020 14:25:23.595787
End Timestamp: 10 Mar 2020 14:25:28.175375
Total Time: 0:00:04.579588
Fold:
Model Parameters: {"bootstrap": true, "ccp_alpha": 0.0, "class_weight": null, "criterion": "gini", "max_depth": 5, "max_features": "auto", "max_leaf_nodes": null, "max_samples": null, "min_impurity_decrease": 0.0, "min_impurity_split": null, "min_samples_leaf": 1, "min_samples_split": 2, "min_weight_fraction_leaf": 0.0, "n_estimators": 500, "n_jobs": null, "oob_score": false, "random_state": 123456789, "verbose": 0, "warm_start": false}
Grid Objective Score (Train) = 0.797874315418175
+----+------+------+-------------+----------+-------------+
| | 0 | 1 | Precision | Recall | F-measure |
+====+======+======+=============+==========+=============+
| 0 | [79] | 8 | 0.849 | 0.908 | 0.878 |
+----+------+------+-------------+----------+-------------+
| 1 | 14 | [42] | 0.840 | 0.750 | 0.792 |
+----+------+------+-------------+----------+-------------+
(row = reference; column = predicted)
Accuracy = 0.8461538461538461
Objective Function Score (Test) = 0.8461538461538461
Additional Evaluation Metrics (Test):
roc_auc = 0.9224137931034483
IRIS Example on Binder
If you prefer using an interactive Jupyter notebook to learn about SKLL, you can do so by clicking the launch button below.