Optimizing tree ensembles

Example of hyperparameter-optimization of a tree-based classifier model.

When optimizing a model that will be ran on an embedded device, we usually want to optimize not just the predictive performance (given by our metric, say accuracy) but also the computational costs of the model (in terms of storage, memory and CPU requirements).

emlearn provides tools for analyzing model costs in the emlearn.evaluate submodule.

This is an example of how to do that, by optimizing hyperparamters using random search, and finding the models that represent good performance/cost trade-offs (Pareto optimimal). The search optimizes the two main things that influence performance and cost: the number of decision nodes in the trees (the depth), and the number of trees in the ensemble (the “breath” of the model).

This method is simple and a good starting point for a broad search of possible models. However if you have a large dataset, consider reducing subsampling the training-set to speed up search.

import os.path

import emlearn
import numpy
import pandas
import seaborn
import matplotlib.pyplot as plt

try:
    # When executed as regular .py script
    here = os.path.dirname(__file__)
except NameError:
    # When executed as Jupyter notebook / Sphinx Gallery
    here = os.getcwd()


from emlearn.examples.datasets.sonar import load_sonar_dataset, tidy_sonar_data

Load dataset

The Sonar dataset is a basic binary-classification dataset, included with scikit-learn Each instance constains the energy across multiple frequency bands (a spectrum).

data = load_sonar_dataset()
tidy = tidy_sonar_data(data)
tidy.head(3)
label energy
sample band
0 0 rock 0.0200
1 0 rock 0.0453
2 0 rock 0.0262


Visualize data

Looking at the overall plot, the data looks easily separable by class. But plotting each sample shows that there is considerable intra-class variability.

seaborn.relplot(data=tidy, kind='line', x='band', y='energy', hue='label',
        height=3, aspect=3)


seaborn.relplot(data=tidy, kind='line', x='band', y='energy', hue='sample',
            row='label', ci=None, aspect=3, height=3, legend=False);
  • trees hyperparameters
  • label = metal, label = rock
/home/docs/checkouts/readthedocs.org/user_builds/emlearn/envs/latest/lib/python3.8/site-packages/seaborn/axisgrid.py:854: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  func(*plot_args, **plot_kwargs)
/home/docs/checkouts/readthedocs.org/user_builds/emlearn/envs/latest/lib/python3.8/site-packages/seaborn/axisgrid.py:854: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  func(*plot_args, **plot_kwargs)

<seaborn.axisgrid.FacetGrid object at 0x7f37b455b580>

Setup model evaluation and optimization

Using RandomizedSearchCV from scikit-learn for random search of hyperparameters. In addition to a standard accuracy metric to estimate the predictive performance of the model, we use two functions from emlearn.evaluate.trees to estimate the storage and compute requirements of the model. The outputs will be used to identify models that offer a good tradeoff between predictive performance and compute costs (Pareto optimal).

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
import sklearn.model_selection

# custom metrics for model costs
from emlearn.evaluate.trees import model_size_bytes, compute_cost_estimate


def evaluate_classifier(model, data, features=None, cut_top=None):
    spectrum_columns = [c for c in data.columns if c.startswith('b.')]

    if features is None:
        features = spectrum_columns
    if cut_top is not None:
        features = features[:-cut_top]

    # minimally prepare dataset
    X = data[features]
    y = data['label']
    X = X.astype('float32')
    y = LabelEncoder().fit_transform(y.astype('str'))
    # split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

    # perform the search
    model.fit(X_train, y_train)
    # summarize
    y_hat = model.predict(X_test)
    acc = accuracy_score(y_test, y_hat)
    print("Accuracy: %.3f" % acc)

def build_hyperparameter_optimizer(hyperparameters={}, cv=10, n_iter=100, n_jobs=-1, verbose=1):

    search = sklearn.model_selection.RandomizedSearchCV(

        RandomForestClassifier(n_jobs=1),

        param_distributions=hyperparameters,
        scoring={
            # our predictive model metric
            'accuracy': sklearn.metrics.make_scorer(sklearn.metrics.accuracy_score),

            # metrics for the model costs
            'size': model_size_bytes,
            'compute': compute_cost_estimate,
        },
        refit='accuracy',
        n_iter=n_iter,
        cv=cv,
        return_train_score=True,
        n_jobs=n_jobs,
        verbose=verbose,
    )

    return search

Build a baseline model

Uses the default hyper-parameters for scikit-learn RandomForestClassifier, not searching any alternatives. This is our un-optimized reference point.

baseline = build_hyperparameter_optimizer(hyperparameters={}, n_iter=1) # default
evaluate_classifier(baseline, data)
baseline_results = pandas.DataFrame(baseline.cv_results_)
baseline_results[['mean_test_accuracy', 'mean_test_size', 'mean_test_compute']]
Fitting 10 folds for each of 1 candidates, totalling 10 fits
Accuracy: 0.797
mean_test_accuracy mean_test_size mean_test_compute
0 0.820879 12695.4 535.135165


Optimize hyper-parameters

The number of trees in the decision forest is optimized using the parameter n_estimators. Multiple strategies are shown for limiting the depth of the trees, using either max_depth, min_samples_leaf, min_samples_split or ccp_alpha.

To keep running the example fast, we only do a limited number of different hyper-parameters (controlled by n_iterations). Better results are expected if increasing this by factor 10-100x.

import scipy.stats

def run_experiment(depth_param=None, n_iter=1):
    import time

    start_time = time.time()

    optimize_params = [
        'n_estimators',
        depth_param,
    ]

    print('Running experiment: ', depth_param)

    hyperparams = { k: v for k, v in parameter_distributions.items() if k in optimize_params }

    search = build_hyperparameter_optimizer(hyperparams, n_iter=n_iter, cv=5)
    evaluate_classifier(search, data)
    df = pandas.DataFrame(search.cv_results_)
    df['depth_param_type'] = depth_param
    df['depth_param_value'] = f'param_{depth_param}'

    end_time = time.time()
    duration = end_time - start_time
    print(f'Experiment took {duration:.2f} seconds', )

    df.to_csv(f'sonar-tuning-results-{depth_param}-{n_iter}.csv')

    return df

# Spaces to search for hyperparameters
parameter_distributions = {
    # regulates width of the ensemble
    'n_estimators': scipy.stats.randint(5, 100),

    # different alternatives to regulates depth of the trees
    'max_depth': scipy.stats.randint(1, 10),
    'min_samples_leaf': scipy.stats.loguniform(0.01, 0.33),
    'min_samples_split': scipy.stats.loguniform(0.01, 0.60),
    'ccp_alpha': scipy.stats.uniform(0.01, 0.20),
}

# Experiments to run, using different ways of constraining tree depth
depth_limiting_parameters = [
    'max_depth',
    'min_samples_leaf',
    'min_samples_split',
    'ccp_alpha',
]

# Number of samples to try for different hyper-parameters
n_iterations = int(os.environ.get('EMLEARN_HYPER_ITERATIONS', 1*100))

results = pandas.concat([ run_experiment(p, n_iter=n_iterations) for p in depth_limiting_parameters ])
results.sort_values('mean_test_accuracy', ascending=False).head(10)[['mean_test_accuracy', 'mean_test_size', 'mean_test_compute']]
Running experiment:  max_depth
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Accuracy: 0.754
Experiment took 63.73 seconds
Running experiment:  min_samples_leaf
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Accuracy: 0.797
Experiment took 64.93 seconds
Running experiment:  min_samples_split
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Accuracy: 0.797
Experiment took 60.58 seconds
Running experiment:  ccp_alpha
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Accuracy: 0.783
Experiment took 60.44 seconds
mean_test_accuracy mean_test_size mean_test_compute
43 0.849735 6211.8 305.891270
84 0.842593 9770.4 440.708730
99 0.835185 9075.6 412.647090
71 0.835185 10891.8 490.412434
65 0.828307 4768.2 232.656085
34 0.828307 1681.2 131.587302
29 0.828307 11134.8 502.488889
50 0.828307 10872.0 499.126455
23 0.828307 9129.6 412.700794
54 0.828042 2077.2 102.477778


Check effect of depth parameter

The different values of the hyperparamterer affecting tree depth influence the regularization considerably. One can see that for certain values there is overfitting (train accuracy near 100%, far above test), and for other values there is a underfitting (train accuracy near test, and test dropping). This means that our search space is at wide enough to cover the relevant area.

Note that n_estimators is also varied and affects the results, but is not visualized here.

def add_performance_references(ax):
    ax.axhline(0.55, ls='--', alpha=0.5, color='black')
    ax.axhline(0.85, ls='--', alpha=0.5, color='green')
    ax.axhline(0.80, ls='--', alpha=0.5, color='orange')

def plot_scores(data, color=None, metric='accuracy', s=10):
    ax = plt.gca()
    x = 'param_' + data['depth_param_type'].unique()[0]

    seaborn.scatterplot(data=data, x=x, y=f'mean_test_{metric}', alpha=0.9, color='blue', label='test', s=s)
    seaborn.scatterplot(data=data, x=x, y=f'mean_train_{metric}', alpha=0.5, color='grey', label='train', s=s)

    add_performance_references(ax)
    ax.set_ylim(0.5, 1.05)
    ax.set_ylabel('accuracy')
    ax.set_title('')
    ax.set_xlabel('')

g = seaborn.FacetGrid(results, col='depth_param_type', col_wrap=2, height=3, aspect=2.5, sharex=False)
g.map_dataframe(plot_scores, s=15);
trees hyperparameters
<seaborn.axisgrid.FacetGrid object at 0x7f37b44677f0>

Trade-off between predictive performance and model costs

There will always be a trade-off between how well a model does, and the costs of the model. But it is only worth considering the models that for a given performance level, have better (lower) cost. The models that fulfill this are said to lie on the Pareto front, or that they are Pareto optimal.

In the below plot: The primary model cost axis has been chosen as the (estimated) compute time, shown along the X axis. Model size is considered secondary, and is visualized using the size of the datapoints. The Pareto optimal models, for each depth limiting strategy, is highlighted with lines.

from emlearn.evaluate.pareto import plot_pareto_front, find_pareto_front

def plot_pareto(results, x='mean_test_compute', **kwargs):
    g = plot_pareto_front(results, cost_metric=x, pareto_global=True, pareto_cut=0.7, **kwargs)
    ax = plt.gca()
    add_performance_references(ax)
    ax.legend()

# sphinx_gallery_thumbnail_number = 4
plot_pareto(results, hue='depth_param_type', height=4, aspect=2.5)
trees hyperparameters

Summarize Pareto-optimal models

Compared to the baseline, the optimized models are many times smaller and execute many times faster, while matching or exceeding performance.

pareto = find_pareto_front(results, min_performance=0.7)
ref = baseline_results.iloc[0]
rel = pareto.copy()
rel = pandas.concat([pareto, baseline_results])
# compute performance relativeto baseline
rel['accuracy'] = (rel['mean_test_accuracy'] - ref['mean_test_accuracy']) * 100
rel['compute'] = ref['mean_test_compute'] / rel['mean_test_compute']
rel['size'] = ref['mean_test_size'] / rel['mean_test_size']
rel = rel.sort_values('accuracy', ascending=False).round(2).reset_index()
rel[['accuracy', 'compute', 'size']]
accuracy compute size
0 2.89 1.75 2.04
1 0.74 4.07 7.55
2 0.72 5.22 6.11
3 0.00 5.30 5.71
4 0.00 1.00 1.00
5 -1.40 6.04 6.97
6 -1.43 6.32 17.20
7 -2.11 6.58 11.38
8 -2.86 20.74 22.32
9 -5.79 23.05 29.27
10 -7.91 24.32 128.24
11 -7.93 59.46 201.51


Total running time of the script: (4 minutes 20.893 seconds)

Gallery generated by Sphinx-Gallery