In order to perform classification, the **Decision Tree Algorithm** uses a sequence that has a structure which mimics a tree. This algorithm divides the dataset into tiny subsets that are used as guides to create **Tree Nodes**. These tree nodes can either be Decision Nodes or Leaf Nodes, depending upon their function.

- Decision Node represents a question or a decision.
- Leaf Node symbolises a decision that has been made or the final result.

Table of Contents

# What is Decision Tree?

A decision tree is a tree where each non-leaf or internal node is associated with a decision and the leaf nodes are generally associated with an outcome or class label. Each internal node tests one or more attribute values leading to two or more links or branches. Each link in turn is associated with a possible value of the decision. **These links are mutually distinct(meaning each of link coming out of nodes in Decision Tree is different from each other) and collectively exhaustive(meaning all of links in Decision Tree collectively covers all of possible variations of choices which can be made from each node)**. This means that it is possible to follow only one of the links and all possibilities will be taken care of meaning there is a link for each possibility.

Decision trees are **excellent tools for choosing between several courses of action**. They provide a highly effective structure within which you can lay out options and investigate the possible outcome of choosing these options. In the case of a **Binary Decision Tree**, each node gives the statement of the decision to be taken or the comparison to be made. There are two outgoing edges from the nodes in a Binary Decision Tree. One edge represents the outcome yes/true and the other edge represents outcome no/false.

## Example of Decision Tree

Let’s say there are four coins titled as A, B, C, D out of which three coins have equal weight but one of them is heavier. We need to find out which is heavier.

This problem can be modelled as a Decision Tree. Let’s say we start by comparing combined weights of two of these coins with combined weight of other two.

At the root node compare weight of A+B with that of C+D. If A+B is actually greater then C+D then compares weight of A, B otherwise compare weight of C,D.

This comparison of weights can be shown as following diagram.

# Working of Decision Tree Algorithm Explained

**Decision Tree Algorithm** starts with splitting up DataSet according to parameters defined in Decision Nodes, so depending upon parameters defined algorithm creates multiple branches coming out of a Decision Node and points these branches to new Nodes. Then algorithm recursively follows those new nodes until it reaches a Leaf Node.

It’s kind of similar to driving a car from Home to a place, firstly you start from Home then take a decision(Decision Node) to turn left/right(possible branches which can come out of a Decision Node) and keep doing these turns until you reach destination place(Leaf Node).

# Constructing a Decision Tree in Machine Learning

The training set consists of a collection of patterns that have been labeled(labeled as features and their respective class labels). The tree is built in such a way that it is consistent with the training set. **The decision tree associated with a training set allows to express a huge number of cases associated with each feature in training set in a straightforward way, which is very useful**(knowledge gain process). This process of extracting logical patterns out of a training dataset is called Inductive Learning.

More specifically it can be said that at each node starting from Root Node, Decision Tree Algorithm **starts performing tests** and **based upon outcome of test** it creates branches out of nodes and then moves onto newly generated nodes one-by-one and then repeats same process of testing until all of nodes becomes Leaf Nodes.

These tests which guide Decision Tree Algorithm to generate new nodes are primarily of three types – Axis-Parallel test, Linear combination of features test, Non-linear combination of features test.

Let me explain what exactly are these three types of tests.

**Axis-Parallel Test**

This test is of form x > a0, where x is a feature in DataSet and a0 is a threshold value. For example – “weight > 10” is a Axis-Parallel Test which is checking whether weight of a person is greater than 10 or not. This test divides plane of points in a dataset into two regions.

Basically, Axis-Parallel Test just involves only one feature of given dataset.

**Linear Combination of Features Test **

This test considers multiple features of a dataset to make a decision about branching out from a node.

- x
_{i}is ith feature in dataset - a
_{i}is weight associated with ith feature - a
_{0}is some constant value

Let’s take an example – **0.1 X height + 0.7 X weight > 10** this will be a line dividing plane of data points into two regions one containing >10 values while other containing <10.

**Non-Linear Combination of Features Test**

This is quite similar to “Linear Combination of Features Test” except that in this case dividing function is non-linear.

- f(x) is a Non-Linear Function

## Which feature in DataSet to use for Spliting Nodes?

In order to build the decision tree, one or more attributes of a feature in training dataset must be selected at each node and a decision must be made. The most important attribute is utilised at the very beginning or few hops aways from Root of the Decision Tree. Moreover while making a **Decision to split up a node, it need to be considered what would outcome of split be**. Is it will lead to generation of more meaningful nodes as compared to current node or not?

For example – Let’s say there’s a node titled Node A and as per some Decision Rule its being split into two other Nodes B, C. In this type of scenario, before making split it need to be checked **whether split would lead to more knowledge gain about data or not**. If split will divide Node A into two other nodes such that now dataset is being clearly partitioned into two separate entities then its quite useful. But if split is leading to nodes which are quite similar to current node or making logical patterns in data invisible then it’s not a useful split and should not be done.

Whether a split of node based upon **class labels of a feature** in dataset is useful or not can be measured as Impurity, which can be calculated mathematically.

T**he Lower the Impurity of making a split, better the split is**.

### Measuring Impurity

**Entropy or Information Impurity**

The entropy impurity at a node N is **i(N)** and can be calculated by using following formula.

- P(w
_{j}) = Fraction of patterns at node N of category w_{j}.

**Variance or Gini Impurity**

At a node N, it is given by **i(N)** and can be calculated by using following formula.

- P(w
_{j}) = Fraction of patterns at node N of category w_{j}.

**Misclassification Impurity**

At a node N it is given by i(N) and can be calculated by using following formula.

- P(w
_{j}) = Fraction of patterns at node N of category w_{j}.

#### Example explaining measurement of three types of Impurity – Decision Tree

Let’s say that a node has 100 data points(consider these as 100 people) out of this 30 belong to High School, 20 belong to Undergraduate and 50 belong to Postdoc.

Let’s assume now that this node is going to split into two child nodes which will contain data as – 10 High School, 0 Undergraduate, 30 Postdoc and 20 High School, 20 Undergraduate, 20 Postdoc.

This split of node looks like following diagram: –

**Calculating Entropy or Information Impurity**

The mathematical Formula for calculating Entropy or Information Impurity is: –

- P(w
_{j}) = Fraction of patterns at node N of category w_{j}

Below is the procedure for calculating Entropy Impurity of Node A. Similar approach can be applied to figure out Entropy Impurity for other nodes in diagram.

Similar to this, Entropy Impurity can be calculated for other nodes – B and C.

**Calculating Variance or Gini Impurity**

The mathematical Formula for calculating Variance or Gini Impurity is: –

- P(w
_{j}) = Fraction of patterns at node N of category w_{j}

Below is the procedure for calculating Variance/Gini Impurity of Node B. Similar approach can be applied to figure outVariance/Gini Impurity for other nodes in diagram.

Similar to this Variance/Gini Impurity can be calculated for other nodes – A, C

**Calculating Misclassification Impurity**

The mathematical Formula for calculating Misclassification Impurity is: –

- P(w
_{j}) = Fraction of patterns at node N of category w_{j}.

Below is the procedure for calculating Misclassification Impurity of Node C. Similar approach can be applied to figure outMisclassification Impurity for other nodes in diagram.

Similar to this Missclassification Impurity can be calculated for other nodes – A, C

#### Images showing calculation of Impurity Metrics – Decision Tree Algorithm

## Example Explaining Which feature to use for splitting based upon Impurity

Impurity which is of three types can be used for identifying splitting by which feature is better. If splitting by one feature(let’s say feature A) leads to more entropy loss as compared to splitting by let’s say feature B. Then it would be better to just split by feature A.

**This loss in impurity can be measured by subtracting overall impurity of nodes after split from impurity of node which has been split.**

Mathematically it can be represented as follows.

- Δi(N) is also called
**Gain in information at node N** - Splitting by whichever feature maximises value of Δi(N) is selected

Next question to answer would be which impurity to use as there are three – Entropy, Variance/Gin Impurity, Classification Impurity. This is quite tricky question to answer and varies based upon dataset to dataset or even from node to node in a single Decision Tree.

Like for splitting up let’s say Node A into two child nodes which feature to use can be decided by calculating (Entropy Impurity before split – Overall Entropy Impurity after split).

But for splitting up some other node in the same Decision Tree (Variance/Gini Impurity before split – Overall Variance/Gini Impurity after split) can be used.

I did tried to go through many research papers related to Machine Learning/Artificial Intelligence, but still did not find any concrete answer to question “Which entropy measurement should be used for splitting?” or what are the factors which can be helpful for deciding this. So if you know anything about this, please do let me know in the comments below. That would be really helpful for me, as well my audience. Thanks 🙏🏻

## When to Stop Splitting Nodes in Decision Tree Algorithm

So far, I’ve discussed about splitting up nodes in Decision Tree to generate child nodes. The question to answer would be “When to stop splitting actually?” or “How to know if splitting has reached to a point where further splitting is not useful at all?”.

Below are three methodologies which can be used to know that splitting has reached its maximum peak of **information unlocking** and further splitting would be useless.

**Use Cross-Validation**

A part of the training data is kept aside for validation. This can also be done repeatedly keeping different sub-sets for validation and taking the average performance. The decision tree which gives the best results during validation is the tree to be considered.

**Reduction in Impurity**

Splitting can be stopped, when the reduction in impurity is very small. This means stop splitting when

- where β is a small value

**Depending upon Global Criterion Function**

If the criterion function is of the form

where **α** is a positive constant and **size = number of nodes or links**, the strategy is to stop splitting when this global criterion reaches **some minimum value**.

# Characteristics of Decision Tree Algorithm

- Links in a Decision Tree are
**mutually distinct**(meaning each of link coming out of nodes in Decision Tree is different from each other) and**collectively exhaustive**(meaning all of links in Decision Tree collectively covers all of possible variations of choices which can be made from each node). - Assuming that
**continuous features would be handled in ranges**, the Decision Tree Algorithm is capable of handling both quantitative and qualitative features. For example – Quantitative Features like Income can be handled as a range rather than discrete values. So rather than having Decision Node as**Income = $10,000**it would be better to have a Decision Node containing quantitative feature as a range like**Income < $10,000**parameter. Handling Qualitative Features is super easy, let’s say – Decision Node can be**Education feature**and branches coming out of this node be – College, High School. **Leaf Nodes**in Decision Tree Algorithm can handle categorical or continuous class labels. For categorical class labels, a classification is performed. For continuous class labels, regression is performed.- Class labels of features in DataSet are associated with Leaf Nodes in a Decision Tree.
- Path from Root Node to Leaf Node represents a rule.
- At each internal node of Decision Tree, decision are being made.