13. Tree-based models

Tree-based models are models such as Decision Trees, Random Forest, Extratrees, and so on. They are simple but very effective models that can handle high-dimensional non-linear problems. As such they are some of the most popular models found in emlearn.

For the general usage of tree ensembles, please refer to the ensemble documentation in scikit-learn. The documentation here covers the topics in using tree-based models that are specific to compute-constrained environments (microcontrollers and embedded devices), and the techniques that emlearn implements to optimize for such usage.

13.1. Optimizing model complexity

The complexity of a tree-based ensemble is a function of its width (the number of trees) and depth of the trees. This influences both the predictive performance and computational costs of the model.

A larger model will generally have higher predictive performance, but need more CPU/RAM/storage. This leads to a trade-off, and different applications may chose different operating points. We may try to find a set of Pareto optimal model alternatives. This is illustrated in the example Optimizing tree ensembles.

13.2. Inference strategy: Inline vs loadable

emlearn supports two different strategies for inference of tree-based models. These are called inline and loadable.

The loadable option uses a EmlTrees data-structure to store the decision tree ensembles. In addition to using a model from the C code generated by emlearn, this also supports loading the model from a file or building the decision tree in memory on-device.

The inline option generates C code statements directly. Each tree is a series of if-else statements, and the merging of the results from multiple trees is also generated code. This code has no dependencies on emlearn headers.

In general, the inline strategy is expected to have the fastest execution time. However the exact impact on code space and execution time depends on the particular model, the target architecture and compiler options. So it may need to be tested for your particular application.

The save() outputs code for both the loadable and inline strategies.

// header generated with emlearn
#include "mymodel.h"

// use "inline" model for classification
const int cls = mymodel_predict(features, features_length)

// use "loadable" model for classification
const int cls = eml_trees_predict(mymodel, features, features_length)

// use "inline" model for classification
const float out = mymodel_predict(features, features_length)

// use "loadable" strategy for regression
const float out = eml_trees_regress1(mymodel, features, features_length)

NOTE: this means the compiler is responsible for eliminating the version that is not used. Make sure you are using suitable compiler options to enable such optimization.

The two strategies normally give identical results. But when combined with other optimizations (see below), they may have slight differences. When evaluating performance in Python, the method argument can be passed to emlearn.convert(). For example emlearn.convert(method='inline').

13.3. Optimization using feature quantization

The default feature representation in emlearn trees is float, 32-bit floating point. However the inline inference strategy also supports using 8 and 16-bit integers.

Quite often it is acceptable to use lower precision for features, and this has multiple benefits.

Reduces the RAM space needed for features
Avoids using floating-point code. Big benefit when there is no hardware FPU
On 8-bit and 16-bit microprocessor architectures, model may take up less code space

To use this feature, make sure all the input data is scaled to be integers that fits in 8/16/32 bits, and set the dtype argument of emlearn.convert() to the appropriate C datatype. For example emlearn.convert(model, dtype=’int8’). A complete example can be found in Feature data-type in tree-based models.

13.4. Optimization using target quantization and leaf-deduplication

The leaf-nodes are the output of the trees. For classification this is the class index or class-probabilities, and for regression it is the predicted value.

emlearn implements leaf de-duplication, such that identical leaves are only stored once across all trees. This can considerably reduce the storage needed for the model. This is particularly applicable to the loadable inference strategy.

For majority-based voting in classifiers, the benefits of leaf-deduplication is automatic. This is because the leaf values is the index of different classes, which naturally a limited set.

For regression, one needs to ensure that the targets are quantized to a small set of values. The best way to do this will be application specific.

13.5. Optimization of features

A high-performing and computationally efficient model is dependent on good input features.

Predictive performance of tree-based models is relatively robust against less-useful features. However they do tend to get used a bit, and may cause higher than-necessary computational costs. Therefore it is good practice to remove features that are completely useless or redundant. This can be achieved with standard feature selection methods.

Creating new features using feature engineering can have a very large impact, and should always be considered in addition to optimizing the classifier. This tends to be very problem/task dependent and specific recommendations are outside the scope of this documentation. But for inspiration, see for example Energy-efficient activity recognition framework using wearable accelerometers where tree-based models outperform Convolutional Neural Networks.