How does Entropy work to split Decision Trees?
How to determine which split is best for a decision tree
Entropy! The one with lowest entropy wins. Think of entropy as the "impurity" of a group. Less entropy = more pure. More entropy = more chaos = less pure.
Note: -log(p) = information. We want to maximize this
while Entropy = - sum (p*log(p)).
Turns out, when you add the probability in, and sum ALL possibilities in a group, the lower the better. It's just how the numbers work out.
Btw, if group is homogenous, i.e. only one possible outcome, then entropy = 0, no randomness by def.
Note: could use Gini index instead, easier calculation, Gini = 1- p^2 for all p, in group, then weighted average of all outcomes from a split, same with Entropy)
(computationally faster, but quite a bit, that's why R uses it as the default, and gives almost the same answer as Entropy, but empirically slightly worse splits, i.e. lower accuracy on testing datasets)
(also recall, for binary splits, left subtree = yes, right = no, sort of like huffman encoding)
Let's work through an example.
Suppose you have 30 total people, two groups, people who would write off or not write off (like default/not default), 16 writes off, 14 do not.
Entropy of entire population = - (16/30 * log(16/30) + 14/30 + log(14/30)) = 0.99679
Now you have two possible splits,
1. By Residence: Own, Rent, or Other
2. By Balance: <50K or >= 50K
To judge which one, find the weighted average of the entropies groups resulting form a split. Whichever has the lowest weighted average, is the split with most information.
For Residence, say
8 people own residence, and of those 1 writes off while 7 does not
10 people rent, and of those 6 writes off and 4 does not
12 people (the remaining) have other residence, and of those 7 writes off and 5 does not.
then, we find the weighted average of them (the expected entropy after split)
E(Entropy(Residence)) = 8/30 * Entropy(own) + 10/30 * Entropy(rent) + 12/30 * Entropy(other)
since:
Entropy(own) = - (1/8 * log(1/8) + 7/8*log(7/8)) = 0.54356
Entropy(other) = - (5/12 * log(5/12) + 7/12*log(7/12)) = 0.97987
then a weighted average would be
= 8/30 * 0.54356 + 10/30 * 0.97095 + 12/30 * 0.97987 = 0.8605
The Information Gain (IG) = Entropy before split - Expected Entropy after split = 0.99679 - 0.8605 = 0.1363
Then, for Balance:
say, 13 people have <50K, and 12/13 wrote off their loan
and 17 people have >= 50K, and 4/17 wrote off their loan
then:
E(Entropy(Balance)) = 13/30 * Entropy(<50K) + 17/30 * Entropy(>=50K)
Entropy(>=50K) = - (4/17 * log(4/17) + 13/17*log(13/17)) = 0.78713
thus
E(Entropy(Balance)) = 13/30 * 0.39124 + 17/30 * 0.78713 = 0.6156
IG = 0.99679 - 0.6156 = 0.3812
Lower Entropy after split = More IG. Since you gain almost 3 times as much information when splitting using Balance, that's the one we choose to split on first.
In case you still care about Gini Index for calculating impurity
You don't exactly have an initial Gini index to go off of, like an initial entropy, because Gini doesn't exactly mean information gain.
But still useful in calculating this groups are more "impure".
using same example as above,
splitting by residence
As a reminder, our dataset shows:
8 people own residence, and of those 1 writes off while 7 does not
10 people rent, and of those 6 writes off and 4 does not
12 people (the remaining) have other residence, and of those 7 writes off and 5 does not.
so
Impurity(own) = 1 - (1/8)^2 - (7/8)^2 = 0.21875
Impurity(other) = 1 - (5/12)^2 - (7/12)^2 = 0.48611111
then find expected value of impurity for split by residence
E(Impurity(residence)) = 8/30 * 0.21875 + 10/30 * 0.48 + 12/30 * 0.48611111 = 0.41277778
Same, for Balance:
reminder: 13 people have <50K, and 12/13 wrote off their loan
and 17 people have >= 50K, and 4/17 wrote off their loan
then:
Impurity(<50K) = 1 - (1/13)^2 - (12/13)^2 = 0.14201183
then find expected value of impurity for split by residence
E(Impurity(balance)) = 13/30 * 0.14201183 + 17/30 * 0.35986159 = 0.26546003
thus we would choose to split by Balance since it is more pure, or less impure.
Then we continue onwards until tree is built, or we have exhausted all our attributes.
For continuous attributes, we can try different thresholds (which could be treated as different attributes) until we find the best split.
Remember, Gini Impurity method does not always generate the same tree as Entropy method. In fact, it is faster but empirically worse at classification when tested.
Comments
Post a Comment