Optimizing tree ensembles

Example of hyperparameter-optimization of a tree-based classifier model.

When optimizing a model that will be ran on an embedded device, we usually want to optimize not just the predictive performance (given by our metric, say accuracy) but also the computational costs of the model (in terms of storage, memory and CPU requirements).

emlearn provides tools for analyzing model costs in the emlearn.evaluate submodule.

This is an example of how to do that, by optimizing hyperparamters using random search, and finding the models that represent good performance/cost trade-offs (Pareto optimimal). The search optimizes the two main things that influence performance and cost: the number of decision nodes in the trees (the depth), and the number of trees in the ensemble (the “breath” of the model).

This method is simple and a good starting point for a broad search of possible models. However if you have a large dataset, consider reducing subsampling the training-set to speed up search.

import os.path

import emlearn
import numpy
import pandas
import seaborn
import matplotlib.pyplot as plt

try:
    # When executed as regular .py script
    here = os.path.dirname(__file__)
except NameError:
    # When executed as Jupyter notebook / Sphinx Gallery
    here = os.getcwd()


from emlearn.examples.datasets.sonar import load_sonar_dataset, tidy_sonar_data

Load dataset

The Sonar dataset is a basic binary-classification dataset, included with scikit-learn Each instance constains the energy across multiple frequency bands (a spectrum).

data = load_sonar_dataset()
tidy = tidy_sonar_data(data)
tidy.head(3)

		label	energy
sample	band
0	0	rock	0.0200
1	0	rock	0.0453
2	0	rock	0.0262

Visualize data

Looking at the overall plot, the data looks easily separable by class. But plotting each sample shows that there is considerable intra-class variability.

seaborn.relplot(data=tidy, kind='line', x='band', y='energy', hue='label',
        height=3, aspect=3)


seaborn.relplot(data=tidy, kind='line', x='band', y='energy', hue='sample',
            row='label', ci=None, aspect=3, height=3, legend=False);

/home/docs/checkouts/readthedocs.org/user_builds/emlearn/envs/latest/lib/python3.8/site-packages/seaborn/axisgrid.py:854: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  func(*plot_args, **plot_kwargs)
/home/docs/checkouts/readthedocs.org/user_builds/emlearn/envs/latest/lib/python3.8/site-packages/seaborn/axisgrid.py:854: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  func(*plot_args, **plot_kwargs)

<seaborn.axisgrid.FacetGrid object at 0x7f62177802e0>

Setup model evaluation and optimization

Using RandomizedSearchCV from scikit-learn for random search of hyperparameters. In addition to a standard accuracy metric to estimate the predictive performance of the model, we use two functions from emlearn.evaluate.trees to estimate the storage and compute requirements of the model. The outputs will be used to identify models that offer a good tradeoff between predictive performance and compute costs (Pareto optimal).

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
import sklearn.model_selection

# custom metrics for model costs
from emlearn.evaluate.trees import model_size_bytes, compute_cost_estimate


def evaluate_classifier(model, data, features=None, cut_top=None):
    spectrum_columns = [c for c in data.columns if c.startswith('b.')]

    if features is None:
        features = spectrum_columns
    if cut_top is not None:
        features = features[:-cut_top]

    # minimally prepare dataset
    X = data[features]
    y = data['label']
    X = X.astype('float32')
    y = LabelEncoder().fit_transform(y.astype('str'))
    # split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

    # perform the search
    model.fit(X_train, y_train)
    # summarize
    y_hat = model.predict(X_test)
    acc = accuracy_score(y_test, y_hat)
    print("Accuracy: %.3f" % acc)

def build_hyperparameter_optimizer(hyperparameters={}, cv=10, n_iter=100, n_jobs=-1, verbose=1):

    search = sklearn.model_selection.RandomizedSearchCV(

        RandomForestClassifier(n_jobs=1),

        param_distributions=hyperparameters,
        scoring={
            # our predictive model metric
            'accuracy': sklearn.metrics.make_scorer(sklearn.metrics.accuracy_score),

            # metrics for the model costs
            'size': model_size_bytes,
            'compute': compute_cost_estimate,
        },
        refit='accuracy',
        n_iter=n_iter,
        cv=cv,
        return_train_score=True,
        n_jobs=n_jobs,
        verbose=verbose,
    )

    return search

Build a baseline model

Uses the default hyper-parameters for scikit-learn RandomForestClassifier, not searching any alternatives. This is our un-optimized reference point.

baseline = build_hyperparameter_optimizer(hyperparameters={}, n_iter=1) # default
evaluate_classifier(baseline, data)
baseline_results = pandas.DataFrame(baseline.cv_results_)
baseline_results[['mean_test_accuracy', 'mean_test_size', 'mean_test_compute']]

Fitting 10 folds for each of 1 candidates, totalling 10 fits
Accuracy: 0.826

	mean_test_accuracy	mean_test_size	mean_test_compute
0	0.813187	12790.8	537.493407

Optimize hyper-parameters

The number of trees in the decision forest is optimized using the parameter n_estimators. Multiple strategies are shown for limiting the depth of the trees, using either max_depth, min_samples_leaf, min_samples_split or ccp_alpha.

To keep running the example fast, we only do a limited number of different hyper-parameters (controlled by n_iterations). Better results are expected if increasing this by factor 10-100x.

import scipy.stats

def run_experiment(depth_param=None, n_iter=1):
    import time

    start_time = time.time()

    optimize_params = [
        'n_estimators',
        depth_param,
    ]

    print('Running experiment: ', depth_param)

    hyperparams = { k: v for k, v in parameter_distributions.items() if k in optimize_params }

    search = build_hyperparameter_optimizer(hyperparams, n_iter=n_iter, cv=5)
    evaluate_classifier(search, data)
    df = pandas.DataFrame(search.cv_results_)
    df['depth_param_type'] = depth_param
    df['depth_param_value'] = f'param_{depth_param}'

    end_time = time.time()
    duration = end_time - start_time
    print(f'Experiment took {duration:.2f} seconds', )

    df.to_csv(f'sonar-tuning-results-{depth_param}-{n_iter}.csv')

    return df

# Spaces to search for hyperparameters
parameter_distributions = {
    # regulates width of the ensemble
    'n_estimators': scipy.stats.randint(5, 100),

    # different alternatives to regulates depth of the trees
    'max_depth': scipy.stats.randint(1, 10),
    'min_samples_leaf': scipy.stats.loguniform(0.01, 0.33),
    'min_samples_split': scipy.stats.loguniform(0.01, 0.60),
    'ccp_alpha': scipy.stats.uniform(0.01, 0.20),
}

# Experiments to run, using different ways of constraining tree depth
depth_limiting_parameters = [
    'max_depth',
    'min_samples_leaf',
    'min_samples_split',
    'ccp_alpha',
]

# Number of samples to try for different hyper-parameters
n_iterations = int(os.environ.get('EMLEARN_HYPER_ITERATIONS', 1*100))

results = pandas.concat([ run_experiment(p, n_iter=n_iterations) for p in depth_limiting_parameters ])
results.sort_values('mean_test_accuracy', ascending=False).head(10)[['mean_test_accuracy', 'mean_test_size', 'mean_test_compute']]

Running experiment:  max_depth
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Accuracy: 0.812
Experiment took 73.51 seconds
Running experiment:  min_samples_leaf
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Accuracy: 0.797
Experiment took 74.89 seconds
Running experiment:  min_samples_split
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Accuracy: 0.812
Experiment took 79.04 seconds
Running experiment:  ccp_alpha
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Accuracy: 0.754
Experiment took 66.55 seconds

	mean_test_accuracy	mean_test_size	mean_test_compute
91	0.849735	5704.2	308.289683
82	0.842063	5932.8	270.977513
88	0.835450	6991.2	313.406349
34	0.835185	7077.6	315.496296
41	0.835185	8938.8	444.747619
77	0.828307	6629.4	300.236772
3	0.828307	5610.6	248.760317
98	0.828307	3697.2	162.626455
79	0.828307	7102.8	353.077513
87	0.828307	10344.6	482.714815

Check effect of depth parameter

The different values of the hyperparamterer affecting tree depth influence the regularization considerably. One can see that for certain values there is overfitting (train accuracy near 100%, far above test), and for other values there is a underfitting (train accuracy near test, and test dropping). This means that our search space is at wide enough to cover the relevant area.

Note that n_estimators is also varied and affects the results, but is not visualized here.

def add_performance_references(ax):
    ax.axhline(0.55, ls='--', alpha=0.5, color='black')
    ax.axhline(0.85, ls='--', alpha=0.5, color='green')
    ax.axhline(0.80, ls='--', alpha=0.5, color='orange')

def plot_scores(data, color=None, metric='accuracy', s=10):
    ax = plt.gca()
    x = 'param_' + data['depth_param_type'].unique()[0]

    seaborn.scatterplot(data=data, x=x, y=f'mean_test_{metric}', alpha=0.9, color='blue', label='test', s=s)
    seaborn.scatterplot(data=data, x=x, y=f'mean_train_{metric}', alpha=0.5, color='grey', label='train', s=s)

    add_performance_references(ax)
    ax.set_ylim(0.5, 1.05)
    ax.set_ylabel('accuracy')
    ax.set_title('')
    ax.set_xlabel('')

g = seaborn.FacetGrid(results, col='depth_param_type', col_wrap=2, height=3, aspect=2.5, sharex=False)
g.map_dataframe(plot_scores, s=15);

<seaborn.axisgrid.FacetGrid object at 0x7f6212af27c0>

Trade-off between predictive performance and model costs

There will always be a trade-off between how well a model does, and the costs of the model. But it is only worth considering the models that for a given performance level, have better (lower) cost. The models that fulfill this are said to lie on the Pareto front, or that they are Pareto optimal.

In the below plot: The primary model cost axis has been chosen as the (estimated) compute time, shown along the X axis. Model size is considered secondary, and is visualized using the size of the datapoints. The Pareto optimal models, for each depth limiting strategy, is highlighted with lines.

from emlearn.evaluate.pareto import plot_pareto_front, find_pareto_front

def plot_pareto(results, x='mean_test_compute', **kwargs):
    g = plot_pareto_front(results, cost_metric=x, pareto_global=True, pareto_cut=0.7, **kwargs)
    ax = plt.gca()
    add_performance_references(ax)
    ax.legend()

# sphinx_gallery_thumbnail_number = 4
plot_pareto(results, hue='depth_param_type', height=4, aspect=2.5)

Summarize Pareto-optimal models

Compared to the baseline, the optimized models are many times smaller and execute many times faster, while matching or exceeding performance.

pareto = find_pareto_front(results, min_performance=0.7)
ref = baseline_results.iloc[0]
rel = pareto.copy()
rel = pandas.concat([pareto, baseline_results])
# compute performance relativeto baseline
rel['accuracy'] = (rel['mean_test_accuracy'] - ref['mean_test_accuracy']) * 100
rel['compute'] = ref['mean_test_compute'] / rel['mean_test_compute']
rel['size'] = ref['mean_test_size'] / rel['mean_test_size']
rel = rel.sort_values('accuracy', ascending=False).round(2).reset_index()
rel[['accuracy', 'compute', 'size']]

	accuracy	compute	size
0	3.65	1.74	2.24
1	2.89	1.98	2.16
2	1.51	3.31	3.46
3	0.77	7.47	8.35
4	0.00	1.00	1.00
5	-1.37	8.90	13.01
6	-1.40	15.54	19.31
7	-3.54	15.54	36.82
8	-3.59	17.88	31.72
9	-4.28	17.89	70.36
10	-5.00	18.74	47.37
11	-6.45	42.04	187.00

Total running time of the script: (5 minutes 6.075 seconds)

Gallery generated by Sphinx-Gallery