.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/trees_hyperparameters.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_trees_hyperparameters.py: Optimizing model size for trees =========================== Example of hyperparameter-optimization of a tree-based classifier model. When optimizing a model that will be ran on an embedded device, we usually want to optimize not just the predictive performance (given by our metric, say accuracy) but also the computational costs of the model (in terms of storage, memory and CPU requirements). emlearn provides tools for analyzing model costs in the `emlearn.evaluate` submodule. This is an example of how to do that, by optimizing hyperparamters using random search, and finding the models that represent good performance/cost trade-offs (Pareto optimimal). The search optimizes the two main things that influence performance and cost: the number of decision nodes in the trees (the depth), and the number of trees in the ensemble (the "breath" of the model). This method is simple and a good starting point for a broad search of possible models. However if you have a large dataset, consider reducing subsampling the training-set to speed up search. .. GENERATED FROM PYTHON SOURCE LINES 26-45 .. code-block:: Python import os.path import emlearn import numpy import pandas import seaborn import matplotlib.pyplot as plt try: # When executed as regular .py script here = os.path.dirname(__file__) except NameError: # When executed as Jupyter notebook / Sphinx Gallery here = os.getcwd() from emlearn.examples.datasets.sonar import load_sonar_dataset, tidy_sonar_data .. GENERATED FROM PYTHON SOURCE LINES 46-51 Load dataset ------------------------ The Sonar dataset is a basic binary-classification dataset, included with scikit-learn Each instance constains the energy across multiple frequency bands (a spectrum). .. GENERATED FROM PYTHON SOURCE LINES 51-57 .. code-block:: Python data = load_sonar_dataset() tidy = tidy_sonar_data(data) tidy.head(3) .. raw:: html
label energy
sample band
0 0 rock 0.0200
1 0 rock 0.0453
2 0 rock 0.0262


.. GENERATED FROM PYTHON SOURCE LINES 58-63 Visualize data ------------------------ Looking at the overall plot, the data looks easily separable by class. But plotting each sample shows that there is considerable intra-class variability. .. GENERATED FROM PYTHON SOURCE LINES 63-73 .. code-block:: Python seaborn.relplot(data=tidy, kind='line', x='band', y='energy', hue='label', height=3, aspect=3) seaborn.relplot(data=tidy, kind='line', x='band', y='energy', hue='sample', row='label', ci=None, aspect=3, height=3, legend=False); .. rst-class:: sphx-glr-horizontal * .. image-sg:: /auto_examples/images/sphx_glr_trees_hyperparameters_001.png :alt: trees hyperparameters :srcset: /auto_examples/images/sphx_glr_trees_hyperparameters_001.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/images/sphx_glr_trees_hyperparameters_002.png :alt: label = metal, label = rock :srcset: /auto_examples/images/sphx_glr_trees_hyperparameters_002.png :class: sphx-glr-multi-img .. rst-class:: sphx-glr-script-out .. code-block:: none /home/docs/checkouts/readthedocs.org/user_builds/emlearn/envs/stable/lib/python3.12/site-packages/seaborn/axisgrid.py:854: FutureWarning: The `ci` parameter is deprecated. Use `errorbar=None` for the same effect. func(*plot_args, **plot_kwargs) /home/docs/checkouts/readthedocs.org/user_builds/emlearn/envs/stable/lib/python3.12/site-packages/seaborn/axisgrid.py:854: FutureWarning: The `ci` parameter is deprecated. Use `errorbar=None` for the same effect. func(*plot_args, **plot_kwargs) .. GENERATED FROM PYTHON SOURCE LINES 74-81 Setup model evaluation and optimization ------------------------ Using RandomizedSearchCV from scikit-learn for random search of hyperparameters. In addition to a standard *accuracy* metric to estimate the predictive performance of the model, we use two functions from `emlearn.evaluate.trees` to estimate the storage and compute requirements of the model. The outputs will be used to identify models that offer a good tradeoff between predictive performance and compute costs (Pareto optimal). .. GENERATED FROM PYTHON SOURCE LINES 81-140 .. code-block:: Python from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder from sklearn.metrics import accuracy_score import sklearn.model_selection # custom metrics for model costs from emlearn.evaluate.trees import model_size_bytes, compute_cost_estimate def evaluate_classifier(model, data, features=None, cut_top=None): spectrum_columns = [c for c in data.columns if c.startswith('b.')] if features is None: features = spectrum_columns if cut_top is not None: features = features[:-cut_top] # minimally prepare dataset X = data[features] y = data['label'] X = X.astype('float32') y = LabelEncoder().fit_transform(y.astype('str')) # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # perform the search model.fit(X_train, y_train) # summarize y_hat = model.predict(X_test) acc = accuracy_score(y_test, y_hat) print("Accuracy: %.3f" % acc) def build_hyperparameter_optimizer(hyperparameters={}, cv=10, n_iter=100, n_jobs=-1, verbose=1): search = sklearn.model_selection.RandomizedSearchCV( RandomForestClassifier(n_jobs=1), param_distributions=hyperparameters, scoring={ # our predictive model metric 'accuracy': sklearn.metrics.make_scorer(sklearn.metrics.accuracy_score), # metrics for the model costs 'size': model_size_bytes, 'compute': compute_cost_estimate, }, refit='accuracy', n_iter=n_iter, cv=cv, return_train_score=True, n_jobs=n_jobs, verbose=verbose, ) return search .. GENERATED FROM PYTHON SOURCE LINES 141-146 Build a baseline model ------------------------ Uses the default hyper-parameters for scikit-learn RandomForestClassifier, not searching any alternatives. This is our un-optimized reference point. .. GENERATED FROM PYTHON SOURCE LINES 146-153 .. code-block:: Python baseline = build_hyperparameter_optimizer(hyperparameters={}, n_iter=1) # default evaluate_classifier(baseline, data) baseline_results = pandas.DataFrame(baseline.cv_results_) baseline_results[['mean_test_accuracy', 'mean_test_size', 'mean_test_compute']] .. rst-class:: sphx-glr-script-out .. code-block:: none Fitting 10 folds for each of 1 candidates, totalling 10 fits Accuracy: 0.783 .. raw:: html
mean_test_accuracy mean_test_size mean_test_compute
0 0.820879 11284.0 534.874176


.. GENERATED FROM PYTHON SOURCE LINES 154-164 Optimize hyper-parameters ------------------------ The number of trees in the decision forest is optimized using the parameter n_estimators. Multiple strategies are shown for limiting the depth of the trees, using *either* `max_depth`, `min_samples_leaf`, `min_samples_split` or `ccp_alpha`. To keep running the example fast, we only do a limited number of different hyper-parameters (controlled by `n_iterations`). Better results are expected if increasing this by factor 10-100x. .. GENERATED FROM PYTHON SOURCE LINES 164-222 .. code-block:: Python import scipy.stats def run_experiment(depth_param=None, n_iter=1): import time start_time = time.time() optimize_params = [ 'n_estimators', depth_param, ] print('Running experiment: ', depth_param) hyperparams = { k: v for k, v in parameter_distributions.items() if k in optimize_params } search = build_hyperparameter_optimizer(hyperparams, n_iter=n_iter, cv=5) evaluate_classifier(search, data) df = pandas.DataFrame(search.cv_results_) df['depth_param_type'] = depth_param df['depth_param_value'] = f'param_{depth_param}' end_time = time.time() duration = end_time - start_time print(f'Experiment took {duration:.2f} seconds', ) df.to_csv(f'sonar-tuning-results-{depth_param}-{n_iter}.csv') return df # Spaces to search for hyperparameters parameter_distributions = { # regulates width of the ensemble 'n_estimators': scipy.stats.randint(5, 100), # different alternatives to regulates depth of the trees 'max_depth': scipy.stats.randint(1, 10), 'min_samples_leaf': scipy.stats.loguniform(0.01, 0.33), 'min_samples_split': scipy.stats.loguniform(0.01, 0.60), 'ccp_alpha': scipy.stats.uniform(0.01, 0.20), } # Experiments to run, using different ways of constraining tree depth depth_limiting_parameters = [ 'max_depth', 'min_samples_leaf', 'min_samples_split', 'ccp_alpha', ] # Number of samples to try for different hyper-parameters n_iterations = int(os.environ.get('EMLEARN_HYPER_ITERATIONS', 1*100)) results = pandas.concat([ run_experiment(p, n_iter=n_iterations) for p in depth_limiting_parameters ]) results.sort_values('mean_test_accuracy', ascending=False).head(10)[['mean_test_accuracy', 'mean_test_size', 'mean_test_compute']] .. rst-class:: sphx-glr-script-out .. code-block:: none Running experiment: max_depth Fitting 5 folds for each of 100 candidates, totalling 500 fits Accuracy: 0.768 Experiment took 25.86 seconds Running experiment: min_samples_leaf Fitting 5 folds for each of 100 candidates, totalling 500 fits Accuracy: 0.841 Experiment took 24.13 seconds Running experiment: min_samples_split Fitting 5 folds for each of 100 candidates, totalling 500 fits Accuracy: 0.797 Experiment took 26.70 seconds Running experiment: ccp_alpha Fitting 5 folds for each of 100 candidates, totalling 500 fits Accuracy: 0.783 Experiment took 20.40 seconds .. raw:: html
mean_test_accuracy mean_test_size mean_test_compute
73 0.835450 3804.8 210.739683
73 0.835185 1987.2 105.047619
23 0.835185 8248.0 451.966931
88 0.835185 2737.6 153.498942
14 0.828307 7206.4 365.644180
46 0.828042 6257.6 321.045503
76 0.828042 2835.2 183.740741
39 0.828042 7072.0 359.139947
27 0.828042 7366.4 400.206614
42 0.828042 3331.2 244.037302


.. GENERATED FROM PYTHON SOURCE LINES 223-232 Check effect of depth parameter ------------------------------- The different values of the hyperparamterer affecting tree depth influence the regularization considerably. One can see that for certain values there is overfitting (train accuracy near 100%, far above test), and for other values there is a underfitting (train accuracy near test, and test dropping). This means that our search space is at wide enough to cover the relevant area. Note that `n_estimators` is also varied and affects the results, but is not visualized here. .. GENERATED FROM PYTHON SOURCE LINES 232-255 .. code-block:: Python def add_performance_references(ax): ax.axhline(0.55, ls='--', alpha=0.5, color='black') ax.axhline(0.85, ls='--', alpha=0.5, color='green') ax.axhline(0.80, ls='--', alpha=0.5, color='orange') def plot_scores(data, color=None, metric='accuracy', s=10): ax = plt.gca() x = 'param_' + data['depth_param_type'].unique()[0] seaborn.scatterplot(data=data, x=x, y=f'mean_test_{metric}', alpha=0.9, color='blue', label='test', s=s) seaborn.scatterplot(data=data, x=x, y=f'mean_train_{metric}', alpha=0.5, color='grey', label='train', s=s) add_performance_references(ax) ax.set_ylim(0.5, 1.05) ax.set_ylabel('accuracy') ax.set_title('') ax.set_xlabel('') g = seaborn.FacetGrid(results, col='depth_param_type', col_wrap=2, height=3, aspect=2.5, sharex=False) g.map_dataframe(plot_scores, s=15); .. image-sg:: /auto_examples/images/sphx_glr_trees_hyperparameters_003.png :alt: trees hyperparameters :srcset: /auto_examples/images/sphx_glr_trees_hyperparameters_003.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 256-267 Trade-off between predictive performance and model costs --------------------------------------------------------------- There will always be a trade-off between how well a model does, and the costs of the model. But it is only worth considering the models that for a given performance level, have better (lower) cost. The models that fulfill this are said to lie on the *Pareto front*, or that they are *Pareto optimal*. In the below plot: The primary model cost axis has been chosen as the (estimated) compute time, shown along the X axis. Model size is considered secondary, and is visualized using the size of the datapoints. The Pareto optimal models, for each depth limiting strategy, is highlighted with lines. .. GENERATED FROM PYTHON SOURCE LINES 267-280 .. code-block:: Python from emlearn.evaluate.pareto import plot_pareto_front, find_pareto_front def plot_pareto(results, x='mean_test_compute', **kwargs): g = plot_pareto_front(results, cost_metric=x, pareto_global=True, pareto_cut=0.7, **kwargs) ax = plt.gca() add_performance_references(ax) ax.legend() # sphinx_gallery_thumbnail_number = 4 plot_pareto(results, hue='depth_param_type', height=4, aspect=2.5) .. image-sg:: /auto_examples/images/sphx_glr_trees_hyperparameters_004.png :alt: trees hyperparameters :srcset: /auto_examples/images/sphx_glr_trees_hyperparameters_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 281-286 Summarize Pareto-optimal models ------------------------ Compared to the baseline, the optimized models are many times smaller and execute many times faster, while matching or exceeding performance. .. GENERATED FROM PYTHON SOURCE LINES 286-300 .. code-block:: Python pareto = find_pareto_front(results, min_performance=0.7) ref = baseline_results.iloc[0] rel = pareto.copy() rel = pandas.concat([pareto, baseline_results]) # compute performance relativeto baseline rel['accuracy'] = (rel['mean_test_accuracy'] - ref['mean_test_accuracy']) * 100 rel['compute'] = ref['mean_test_compute'] / rel['mean_test_compute'] rel['size'] = ref['mean_test_size'] / rel['mean_test_size'] rel = rel.sort_values('accuracy', ascending=False).round(2).reset_index() rel[['accuracy', 'compute', 'size']] .. raw:: html
accuracy compute size
0 1.46 2.54 2.97
1 1.43 5.09 5.68
2 0.03 6.52 7.00
3 0.00 6.65 7.37
4 0.00 1.00 1.00
5 -2.86 14.01 39.40
6 -3.60 22.14 30.27
7 -4.28 37.16 146.93
8 -8.65 62.19 201.50


.. rst-class:: sphx-glr-timing **Total running time of the script:** (1 minutes 42.662 seconds) .. _sphx_glr_download_auto_examples_trees_hyperparameters.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: trees_hyperparameters.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: trees_hyperparameters.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: trees_hyperparameters.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_