RFC prediction are inconsistent when using `max_depth` #52

skjerns · 2019-05-10T09:59:33Z

I have created a RandomForestclassifier in Python using sklearn. Now I convert the code to C using sklearn-porter. In around 10-20% of the cases the prediction of the transpiled code is wrong.

I figured that the problem occurs when specifying max_depth.

Here's some code to reproduce the issue:

import numpy as np
import sklearn_porter
from sklearn.ensemble import RandomForestClassifier

train_x = np.random.rand(1000, 8)
train_y = np.random.randint(0, 4, 1000)

# when using max_depth='auto', the problem does not occur
rfc = RandomForestClassifier(n_estimators=10)
rfc.fit(train_x, train_y)
porter = sklearn_porter.Porter(rfc, language='c')
print(porter.integrity_score(train_x)) # 1.0

# now using max_depth=10 the integrity
rfc = RandomForestClassifier(n_estimators=10, max_depth=10)
rfc.fit(train_x, train_y)
porter = sklearn_porter.Porter(rfc, language='c')
print(porter.integrity_score(train_x)) # 0.829

I also saw that Python is performing calculations with double while the C code seems to use float, might that be an issue? (changing float -> double did not change anything unfortunately).

The text was updated successfully, but these errors were encountered:

skjerns · 2019-06-07T14:13:27Z

Looking further into this issue I believe it might be something with the final leave probabilities. They are slightly different when not growing the tree to the max. Therefore the final probability can deviate if the samples are very close to each other

nok · 2019-06-25T11:20:16Z

Thanks for your work and the given hints. I will check the outputs with more tests. Did you maybe check another languages? Or is it still a C issue?

skjerns · 2019-06-25T12:19:12Z

I did not check other languages yet, but I assume that they have the same problem. I can check tomorrow.

skjerns · 2019-06-26T10:25:36Z

Checked it in Java: Same results. I assume it will be the same in other languages.

nok · 2019-06-26T11:44:42Z

Okay, thank you for the double check. Then I will dig deeper in the original implementation. In particular in the difference between the different max_depth conditions.

skjerns · 2019-06-26T12:59:07Z

I think a good way to approach is to implement a predict_proba-method. I originally assumed that we just let each tree predict a class and take the majority vote (as it is done in the implementation of sklearn-porter). However, this is not the case and like the reason why we have this discrepancy.

Some more details I found in this stackoverflow comment thread:
https://stackoverflow.com/questions/30814231/using-the-predict-proba-function-of-randomforestclassifier-in-the-safe-and-rig
(see comments)

About prediction precision: I insist but this is not a question of number of trees. Even with a single decision tree you should be able to get probability predicitions with more than one digits. A decision tree aims at clustering the inputs based on some rules (the decision), and these clusters are the leafs of the tree. If you have a leaf with 2 non-spam emails and one spam email from your training data, then the probability prediction for any email that belongs to this leaf/cluster (with regards to the rules established by fitting the model), is : 1/3 for spam and 2/3 for non-spam. – Sebastien Jun 20 '15 at 14:49

About the dependencies in predictions: Again Sklearn definition gives the answer : the probability is computed with regards to the leaf (corresponding to your email to test) 's characteristics : the number of instances of each class in this leaf. This is set when your model is fitted, so it only depends on the training data. In conclusion : the result is the probability of instance 1 to spam with 60% whatever the other 9 instances' probabilities are. – Sebastien Jun 20 '15 at 15:00

similarly here: https://scikit-learn.org/stable/modules/tree.html#tree

So I think if a predict_proba method is implemented correctly (instead of majority winner vote), the problems with max_depth will disappear. And another cool feature would be added, class probabilities :)

skjerns · 2019-06-28T10:20:56Z

This seems to be the case, indeed:

So depending on implementation: predicted probability is either (a) the mean terminal leaf probability across all trees or (b) the fraction of trees voting either class. If out-of-bag(OOB) prediction, then only in trees where sample is OOB. For a single fully grown tree, I would guess the predicted probability only could be 0 or 1 for any class, because all terminal nodes are pure(same label). If the single tree is not fully grown and/or more trees are grown, then the predicted probability can be a positive rational number from 0 to 1.

https://stats.stackexchange.com/questions/193424/is-decision-tree-output-a-prediction-or-class-probabilities

So we'd need to change the internal structure such that each tree does not return the class index but a probability vector.

nok · 2019-07-11T21:11:49Z

Hello @skjerns, JFYI, I started to implement the predict_proba method for all listed estimators.

For that I began with the DecisionTreeClassifier estimator and the high-level languages. After that I will focus on the RandomForestclassifier estimator with the DecisionTreeClassifier as base estimator.

crea-psfc · 2019-08-06T22:31:39Z

Hi @nok and @skjerns, I have actually looked into this as I wanted to integrate in the porter C library the functionality for analyzing the feature contributions: https://github.com/andosa/treeinterpreter . This technique allows you to extract the importance of the features, when testing unseen samples and uncovers the drivers of the final Random Forests decision. It basically keeps track of the samples population before/after a split by associating gains/losses to the splitting feature. I'm introducing this as it is a pretty short step between implementing this and the predict_proba method for the forest. I am currently working on that.

@nok, let me know how you wanna proceed and I can commit on a dev branch my changes to the C templates and the __init__.py file.

skjerns changed the title ~~RFC prediction different from sklearn an C code~~ RFC prediction are inconsistent when using max_depth May 15, 2019

skjerns mentioned this issue Jun 7, 2019

RandomForestClassifier only works with int emlearn/emlearn#9

Closed

skjerns mentioned this issue Jun 28, 2019

RFC: Inconsistent results when using max_depth - support soft voting emlearn/emlearn#12

Open

jonnor mentioned this issue Mar 17, 2023

Prediction for ExtraTree model differs from sklearn (tested for C model) #35

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC prediction are inconsistent when using `max_depth` #52

RFC prediction are inconsistent when using `max_depth` #52

skjerns commented May 10, 2019 •

edited

Loading

skjerns commented Jun 7, 2019

nok commented Jun 25, 2019

skjerns commented Jun 25, 2019

skjerns commented Jun 26, 2019

nok commented Jun 26, 2019

skjerns commented Jun 26, 2019 •

edited

Loading

skjerns commented Jun 28, 2019 •

edited

Loading

nok commented Jul 11, 2019 •

edited

Loading

crea-psfc commented Aug 6, 2019

RFC prediction are inconsistent when using max_depth #52

RFC prediction are inconsistent when using max_depth #52

Comments

skjerns commented May 10, 2019 • edited Loading

skjerns commented Jun 7, 2019

nok commented Jun 25, 2019

skjerns commented Jun 25, 2019

skjerns commented Jun 26, 2019

nok commented Jun 26, 2019

skjerns commented Jun 26, 2019 • edited Loading

skjerns commented Jun 28, 2019 • edited Loading

nok commented Jul 11, 2019 • edited Loading

crea-psfc commented Aug 6, 2019

RFC prediction are inconsistent when using `max_depth` #52

RFC prediction are inconsistent when using `max_depth` #52

skjerns commented May 10, 2019 •

edited

Loading

skjerns commented Jun 26, 2019 •

edited

Loading

skjerns commented Jun 28, 2019 •

edited

Loading

nok commented Jul 11, 2019 •

edited

Loading