How does Entropy work to split Decision Trees?

How to determine which split is best for a decision tree

Entropy! The one with lowest entropy wins. Think of entropy as the "impurity" of a group. Less entropy = more pure. More entropy = more chaos = less pure. 

Note: -log(p) = information. We want to maximize this

while Entropy = - sum (p*log(p)). 

Turns out, when you add the probability in, and sum ALL possibilities in a group, the lower the better. It's just how the numbers work out. 

Btw, if group is homogenous, i.e. only one possible outcome, then entropy = 0, no randomness by def. 

    Note: could use Gini index instead, easier calculation, Gini = 1- p^2 for all p, in group, then weighted average of all outcomes from a split, same with Entropy)

    (computationally faster, but quite a bit, that's why R uses it as the default, and gives almost the same answer as Entropy, but empirically slightly worse splits, i.e. lower accuracy on testing datasets)

(also recall, for binary splits, left subtree = yes, right = no, sort of like huffman encoding)


Let's work through an example.

Suppose you have 30 total people, two groups, people who would write off or not write off (like default/not default), 16 writes off, 14 do not. 

Entropy of entire population = - (16/30 * log(16/30) + 14/30 + log(14/30)) = 0.99679


Now you have two possible splits,

1. By Residence: Own, Rent, or Other

2. By Balance: <50K or >= 50K


To judge which one, find the weighted average of the entropies groups resulting form a split. Whichever has the lowest weighted average, is the split with most information. 


For Residence, say

8 people own residence, and of those 1 writes off while 7 does not

10 people rent, and of those 6 writes off and 4 does not

12 people (the remaining) have other residence, and of those 7 writes off and 5 does not. 


then, we find the weighted average of them (the expected entropy after split)   

    E(Entropy(Residence)) = 8/30 * Entropy(own) + 10/30 * Entropy(rent) + 12/30 * Entropy(other)

since:

    Entropy(own) = - (1/8 * log(1/8) + 7/8*log(7/8)) = 0.54356

    Entropy(rent) = - (4/10 * log(4/10) + 6/10*log(6/10)) = 0.97095

    Entropy(other) = - (5/12 * log(5/12) + 7/12*log(7/12))  = 0.97987

then a weighted average would be 

    = 8/30 *  0.54356 + 10/30 * 0.97095 + 12/30 * 0.97987 = 0.8605


The Information Gain (IG)
= Entropy before split - Expected Entropy after split = 0.99679 - 0.8605 = 0.1363


Then, for Balance:

say, 13 people have <50K, and 12/13 wrote off their loan

and 17 people have >= 50K, and 4/17 wrote off their loan

then: 

    E(Entropy(Balance)) = 13/30 * Entropy(<50K) + 17/30 * Entropy(>=50K)

    Entropy(<50K) = - (1/13 * log(1/13) + 12/13*log(12/13)) = 0.39124 

    Entropy(>=50K) = - (4/17 * log(4/17) + 13/17*log(13/17)) = 0.78713

thus

    E(Entropy(Balance)) = 13/30 * 0.39124 + 17/30 * 0.78713  = 0.6156

    IG = 0.99679 - 0.6156 = 0.3812


Lower Entropy after split = More IG. Since you gain almost 3 times as much information when splitting using Balance, that's the one we choose to split on first. 

    

In case you still care about Gini Index for calculating impurity

You don't exactly have an initial Gini index to go off of, like an initial entropy, because Gini doesn't exactly mean information gain. 

But still useful in calculating this groups are more "impure". 

using same example as above, 

splitting by residence

As a reminder, our dataset shows:

8 people own residence, and of those 1 writes off while 7 does not

10 people rent, and of those 6 writes off and 4 does not

12 people (the remaining) have other residence, and of those 7 writes off and 5 does not. 

so

    Impurity(own) = 1 - (1/8)^2 - (7/8)^2 = 0.21875

    Impurity(rent) = 1 - (4/10)^2 - (6/10)^2 = 0.48

    Impurity(other) = 1 - (5/12)^2 - (7/12)^2  = 0.48611111

then find expected value of impurity for split by residence

    E(Impurity(residence)) = 8/30 *  0.21875 + 10/30 * 0.48 + 12/30 * 0.48611111 = 0.41277778


Same, for Balance:

reminder: 13 people have <50K, and 12/13 wrote off their loan

and 17 people have >= 50K, and 4/17 wrote off their loan

then: 

    Impurity(<50K) = 1 - (1/13)^2 - (12/13)^2 = 0.14201183

    Impurity(>=50K) = 1 - (4/17)^2 - (13/17)^2 = 0.35986159

then find expected value of impurity for split by residence

    E(Impurity(balance)) = 13/30 * 0.14201183 + 17/30 * 0.35986159  = 0.26546003


thus we would choose to split by Balance since it is more pure, or less impure. 


Then we continue onwards until tree is built, or we have exhausted all our attributes. 

For continuous attributes, we can try different thresholds (which could be treated as different attributes) until we find the best split. 

Remember, Gini Impurity method does not always generate the same tree as Entropy method. In fact, it is faster but empirically worse at classification when tested. 



Comments

Popular posts from this blog

Idiosyncrasies of Modulo Arithmetic

Lumen Candela Lux Nits