Decision Tree Algorithm Explained

In order to perform classification, the Decision Tree Algorithm uses a sequence that has a structure which mimics a tree. This algorithm divides the dataset into tiny subsets that are used as guides to create Tree Nodes. These tree nodes can either be Decision Nodes or Leaf Nodes, depending upon their function.

• Decision Node represents a question or a decision.
• Leaf Node symbolises a decision that has been made or the final result.

What is Decision Tree?

A decision tree is a tree where each non-leaf or internal node is associated with a decision and the leaf nodes are generally associated with an outcome or class label. Each internal node tests one or more attribute values leading to two or more links or branches. Each link in turn is associated with a possible value of the decision. These links are mutually distinct(meaning each of link coming out of nodes in Decision Tree is different from each other) and collectively exhaustive(meaning all of links in Decision Tree collectively covers all of possible variations of choices which can be made from each node). This means that it is possible to follow only one of the links and all possibilities will be taken care of meaning there is a link for each possibility.

Decision trees are excellent tools for choosing between several courses of action. They provide a highly effective structure within which you can lay out options and investigate the possible outcome of choosing these options. In the case of a Binary Decision Tree, each node gives the statement of the decision to be taken or the comparison to be made. There are two outgoing edges from the nodes in a Binary Decision Tree. One edge represents the outcome yes/true and the other edge represents outcome no/false.

Example of Decision Tree

Let’s say there are four coins titled as A, B, C, D out of which three coins have equal weight but one of them is heavier. We need to find out which is heavier.

This problem can be modelled as a Decision Tree. Let’s say we start by comparing combined weights of two of these coins with combined weight of other two.

At the root node compare weight of A+B with that of C+D. If A+B is actually greater then C+D then compares weight of A, B otherwise compare weight of C,D.

This comparison of weights can be shown as following diagram.

Working of Decision Tree Algorithm Explained

Decision Tree Algorithm starts with splitting up DataSet according to parameters defined in Decision Nodes, so depending upon parameters defined algorithm creates multiple branches coming out of a Decision Node and points these branches to new Nodes. Then algorithm recursively follows those new nodes until it reaches a Leaf Node.

It’s kind of similar to driving a car from Home to a place, firstly you start from Home then take a decision(Decision Node) to turn left/right(possible branches which can come out of a Decision Node) and keep doing these turns until you reach destination place(Leaf Node).

Constructing a Decision Tree in Machine Learning

The training set consists of a collection of patterns that have been labeled(labeled as features and their respective class labels). The tree is built in such a way that it is consistent with the training set. The decision tree associated with a training set allows to express a huge number of cases associated with each feature in training set in a straightforward way, which is very useful(knowledge gain process). This process of extracting logical patterns out of a training dataset is called Inductive Learning.

More specifically it can be said that at each node starting from Root Node, Decision Tree Algorithm starts performing tests and based upon outcome of test it creates branches out of nodes and then moves onto newly generated nodes one-by-one and then repeats same process of testing until all of nodes becomes Leaf Nodes.

These tests which guide Decision Tree Algorithm to generate new nodes are primarily of three types – Axis-Parallel test, Linear combination of features test, Non-linear combination of features test.

Let me explain what exactly are these three types of tests.

Axis-Parallel Test

This test is of form x > a0, where x is a feature in DataSet and a0 is a threshold value. For example – “weight > 10” is a Axis-Parallel Test which is checking whether weight of a person is greater than 10 or not. This test divides plane of points in a dataset into two regions.
Basically, Axis-Parallel Test just involves only one feature of given dataset.

Linear Combination of Features Test

This test considers multiple features of a dataset to make a decision about branching out from a node.

$$\sum_{i=1}^{d} a_{i} x_{i}>a_{0}$$
• xi is ith feature in dataset
• ai is weight associated with ith feature
• a0 is some constant value

Let’s take an example – 0.1 X height + 0.7 X weight > 10 this will be a line dividing plane of data points into two regions one containing >10 values while other containing <10.

Non-Linear Combination of Features Test

This is quite similar to “Linear Combination of Features Test” except that in this case dividing function is non-linear.

$$f(x)$$
• f(x) is a Non-Linear Function

Which feature in DataSet to use for Spliting Nodes?

In order to build the decision tree, one or more attributes of a feature in training dataset must be selected at each node and a decision must be made. The most important attribute is utilised at the very beginning or few hops aways from Root of the Decision Tree. Moreover while making a Decision to split up a node, it need to be considered what would outcome of split be. Is it will lead to generation of more meaningful nodes as compared to current node or not?

For example – Let’s say there’s a node titled Node A and as per some Decision Rule its being split into two other Nodes B, C. In this type of scenario, before making split it need to be checked whether split would lead to more knowledge gain about data or not. If split will divide Node A into two other nodes such that now dataset is being clearly partitioned into two separate entities then its quite useful. But if split is leading to nodes which are quite similar to current node or making logical patterns in data invisible then it’s not a useful split and should not be done.

Whether a split of node based upon class labels of a feature in dataset is useful or not can be measured as Impurity, which can be calculated mathematically.
The Lower the Impurity of making a split, better the split is.

Measuring Impurity

Entropy or Information Impurity

The entropy impurity at a node N is i(N) and can be calculated by using following formula.

$$i(N)=-\sum_{j} P\left(w_{j}\right) \log _{2} P\left(w_{j}\right)$$
• P(wj) = Fraction of patterns at node N of category wj.

Variance or Gini Impurity

At a node N, it is given by i(N) and can be calculated by using following formula.

$$i(N)=\sum_{i \neq j} P\left(w_{i}\right) P\left(w_{j}\right)=\frac{1}{2}\left[1-\sum_{j} P^{2} i\left(w_{j}\right)\right]$$
• P(wj) = Fraction of patterns at node N of category wj.

Misclassification Impurity

At a node N it is given by i(N) and can be calculated by using following formula.

$$i(N)=1-\max _{j} P\left(w_{j}\right)$$
• P(wj) = Fraction of patterns at node N of category wj.

Example explaining measurement of three types of Impurity – Decision Tree

Let’s say that a node has 100 data points(consider these as 100 people) out of this 30 belong to High School, 20 belong to Undergraduate and 50 belong to Postdoc.

Let’s assume now that this node is going to split into two child nodes which will contain data as – 10 High School, 0 Undergraduate, 30 Postdoc and 20 High School, 20 Undergraduate, 20 Postdoc.

This split of node looks like following diagram: –

Calculating Entropy or Information Impurity

The mathematical Formula for calculating Entropy or Information Impurity is: –

$$i(N)=-\sum_{j} P\left(w_{j}\right) \log _{2} P\left(w_{j}\right)$$
• P(wj) = Fraction of patterns at node N of category wj

Below is the procedure for calculating Entropy Impurity of Node A. Similar approach can be applied to figure out Entropy Impurity for other nodes in diagram.

\begin{gathered} \left.i(N o d e A)=-\mathrm{P}(\text {High School}) \log _{2} P(\text {High School})-\mathrm{P}(\text {Undergraduate}) \log _{2} P(\text {Undergraduate})-\mathrm{P}(\text {Postdoc}) \log _{2} P \text { (Postdoc}\right) \\ P(\text {High School})=\frac{\text { Number of times High School labelled data point in Node } \mathrm{A}}{\text { Total Number of data points in Node A }}=\frac{30}{30+20+50}=\frac{30}{100}=0.3 \\ P(\text {Undergraduate})=\frac{\text { Number of times Undergraduate labelled data point in Node } A}{\text { Total Number of data points in Node } A}=\frac{20}{30+20+50}=\frac{20}{100}=0.2 \\ P(\text {Postdoc})=\frac{\text { Number of times Postdoc labelled data point in Node } \mathrm{A}}{\text { Total Number of data points in Node A }}=\frac{50}{30+20+50}=\frac{50}{100}=0.5 \\ \log _{2} P(\text {High School})=\log _{2}(0.3)=-1.73 \\ \log _{2} P(\text {Undergraduate})=\log _{2}(0.2)=-2.32 \\ \quad \log _{2} P(\text {Postdoc})=\log _{2}(0.5)=-1.0 \\ i(\text {Node}A)=(-0.3)(-1.73)-(0.2)(-2.32)-(0.5)(-1.0)=0.519+0.464+0.5=1.483 \end{gathered}

Similar to this, Entropy Impurity can be calculated for other nodes – B and C.

Calculating Variance or Gini Impurity

The mathematical Formula for calculating Variance or Gini Impurity is: –

$$i(N)=\sum_{i \neq j} P\left(w_{i}\right) P\left(w_{j}\right)=\frac{1}{2}\left[1-\sum_{j} P^{2} i\left(w_{j}\right)\right]$$

• P(wj) = Fraction of patterns at node N of category wj

Below is the procedure for calculating Variance/Gini Impurity of Node B. Similar approach can be applied to figure out Variance/Gini Impurity for other nodes in diagram.

\begin{aligned} &i(\text {NodeB})=P(\text {High School}) P(\text {Undergraduate})+P(\text {High School}) P(\text {Postdoc})+P(\text {Undergraduate}) P(\text {Postdoc}) \\ &P(\text {High School})=\frac{\text { Number of times High School data point in Node } \mathrm{B}}{\text { Total Number of data points in Node B }}=\frac{10}{10+0+30}=\frac{10}{40}=0.25 \\ &P(\text {Undergraduate})=\frac{\text { Number of times Undergraduate labelled data point in Node } \mathrm{B}}{\text { Total Number of data points in Node } \mathrm{B}}=\frac{0}{10+0+30}=0 \\ &P(\text {Postdoc})=\frac{\text { Number of times Postdoc labelled data point in Node } \mathrm{B}}{\text { Total Number of data points in Node B }}=\frac{30}{10+0+30}=\frac{30}{40}=0.75 \\ &i(\text {NodeB})=0.25 \times 0+0.25 \times 0.75+0 \times 0.75=0.25 \times 0.75=0.1875 \end{aligned}

Similar to this Variance/Gini Impurity can be calculated for other nodes – A, C

Calculating Misclassification Impurity

The mathematical Formula for calculating Misclassification Impurity is: –

$$i(N)=1-\max _{j} P\left(w_{j}\right)$$
• P(wj) = Fraction of patterns at node N of category wj.

Below is the procedure for calculating Misclassification Impurity of Node C. Similar approach can be applied to figure out Misclassification Impurity for other nodes in diagram.

\begin{gathered} i(\operatorname{NodeC})=1-\max (P \text {(High School}), P(\text {Undergraduate}), P(\text {Postdoc})) \\ P(\text {High School})=\frac{\text { Number of times High School data point in } \operatorname{Node} C}{\text { Total Number of data points in Node } C}=\frac{20}{20+20+20}=\frac{10}{60}=0.166 \\ P(\text {Undergraduate})=\frac{\text { Number of times Undergraduate labelled data point in Node } C}{\text { Total Number of data points in Node } C}=\frac{20}{20+20+20}=\frac{10}{60}=0.166 \\ P(\text {Postdoc})=\frac{\text { Number of times Postdoc labelled data point in Node } C}{\text { Total Number of data points in Node } C}=\frac{20}{20+20+20}=\frac{10}{60}=0.166 \\ i(N o d e C)=1-\max (0.166,0.166,0.166)=1-0.166=0.834 \end{gathered}

Similar to this Missclassification Impurity can be calculated for other nodes – A, C

Example Explaining Which feature to use for splitting based upon Impurity

Impurity which is of three types can be used for identifying splitting by which feature is better. If splitting by one feature(let’s say feature A) leads to more entropy loss as compared to splitting by let’s say feature B. Then it would be better to just split by feature A.

This loss in impurity can be measured by subtracting overall impurity of nodes after split from impurity of node which has been split.

Mathematically it can be represented as follows.

$$\Delta \mathrm{i}(\mathrm{N})=(\text {Entropy before Split) – (Overall impurity after Split) }$$
• Δi(N) is also called Gain in information at node N
• Splitting by whichever feature maximises value of Δi(N) is selected

Next question to answer would be which impurity to use as there are three – Entropy, Variance/Gin Impurity, Classification Impurity. This is quite tricky question to answer and varies based upon dataset to dataset or even from node to node in a single Decision Tree.

Like for splitting up let’s say Node A into two child nodes which feature to use can be decided by calculating (Entropy Impurity before split – Overall Entropy Impurity after split).

But for splitting up some other node in the same Decision Tree (Variance/Gini Impurity before split – Overall Variance/Gini Impurity after split) can be used.

I did tried to go through many research papers related to Machine Learning/Artificial Intelligence, but still did not find any concrete answer to question “Which entropy measurement should be used for splitting?” or what are the factors which can be helpful for deciding this. So if you know anything about this, please do let me know in the comments below. That would be really helpful for me, as well my audience. Thanks 🙏🏻

When to Stop Splitting Nodes in Decision Tree Algorithm

So far, I’ve discussed about splitting up nodes in Decision Tree to generate child nodes. The question to answer would be “When to stop splitting actually?” or “How to know if splitting has reached to a point where further splitting is not useful at all?”.

Below are three methodologies which can be used to know that splitting has reached its maximum peak of information unlocking and further splitting would be useless.

Use Cross-Validation

A part of the training data is kept aside for validation. This can also be done repeatedly keeping different sub-sets for validation and taking the average performance. The decision tree which gives the best results during validation is the tree to be considered.

Reduction in Impurity

Splitting can be stopped, when the reduction in impurity is very small. This means stop splitting when

$$\max _{s} i(s) \leq \beta$$
• where β is a small value

Depending upon Global Criterion Function

If the criterion function is of the form

$$\alpha \times \text { size }+\sum_{\text {leafnodes }} i(N)$$

where α is a positive constant and size = number of nodes or links, the strategy is to stop splitting when this global criterion reaches some minimum value.

Characteristics of Decision Tree Algorithm

1. Links in a Decision Tree are mutually distinct(meaning each of link coming out of nodes in Decision Tree is different from each other) and collectively exhaustive(meaning all of links in Decision Tree collectively covers all of possible variations of choices which can be made from each node).
2. Assuming that continuous features would be handled in ranges, the Decision Tree Algorithm is capable of handling both quantitative and qualitative features. For example – Quantitative Features like Income can be handled as a range rather than discrete values. So rather than having Decision Node as Income = $10,000 it would be better to have a Decision Node containing quantitative feature as a range like Income <$10,000 parameter. Handling Qualitative Features is super easy, let’s say – Decision Node can be Education feature and branches coming out of this node be – College, High School.
3. Leaf Nodes in Decision Tree Algorithm can handle categorical or continuous class labels. For categorical class labels, a classification is performed. For continuous class labels, regression is performed.
4. Class labels of features in DataSet are associated with Leaf Nodes in a Decision Tree.
5. Path from Root Node to Leaf Node represents a rule.
6. At each internal node of Decision Tree, decision are being made.