We are very familiar with the word Machine Learning nowadays. Machine Learning technologies are used everywhere. Many IT companies use this type of technologies to improve their product. Decision tree algorithm is one of the important algorithms used in Machine Learning models. It can be used in both classification and regression techniques. In this article, we are going to explore more about decision tree algorithm.
What is Decision tree algorithm?
The basic structure of a tree is made by nodes and branches. In this decision tree there will be root node, leave nodes and branches. It is basically a graphical representation of all the possible solutions available for a given problem depending upon conditions. This is mainly supervised learning. It can be used in both classification and regression. But it is mostly used in classification problems. In the tree, the internal nodes represent the features of data set and the branches of tree represents the decision rules and the leave nodes represent the outputs. We can use Classification and Regression Tree Algorithm in order to build the decision tree.
Let’s understand the concept of decision tree by a simple and easy example. Suppose one candidate has got an offer letter and previously he also got another one offer letter for job. Now he wants to decide his company to job. Here we can use the decision tree. So, first we start with root node. In this root node we check whether the salary is thirty thousand and above or not. If the branch says yes then we go to next node and if the branch goes for no then we can say that this job is declined. Next node can check the location. In this way you can check the features of data set and classify the input data.
Terminologies used in Decision tree algorithm
- Branch/Sub Tree: There exists many sub trees in a decision tree. This sub trees form the main decision tree.
- Leaf Node: From the word leaf, we can understand that this is the ending of our decision tree. It mainly represents the final output of the decision tree.
- Parent/Child Node: We have previously discussed about sub tree. In this sub tree a node come from another node by branch. This next node is called child and previous node is named parent node.
- Pruning: It is a process by which we can minimize the branches of a tree.
- Root Node: It is the node of the tree from where the tree begins. It represents the whole data set. It is divided into subtrees.
- Splitting: It is the process by which we can divide the node into sub nodes.
Working process of Decision tree algorithm
Now, we are going to understand the working process of decision tree algorithm. In this decision tree algorithm, the input data set is taken. After that the attributes of input data is compared with the existing data set and then the branch is chosen. After that we go to next node. In next node, the same process is followed again. These things repeat until we get leaf node means our output. You can understand the above process by this algorithm mentioned below:
- Step-1: We will begin the tree by initializing the root node. This root node will represent the whole data set.
- Step-2: After that we will find the best attribute from the data set. We will use attribute selection measure for this thing.
- Step-3: In next step we will divide the whole data set into subsets which can possess best possible attributes.
- Step-4: After that we will create the node which have best attribute in the decision tree.
- Step-5: We will do this process recursively by dividing the data set into subsets and creating new decision tree nodes. This process will continue till we get the best possible result. When we cannot go further in this recursion we will declare that node as leaf node. This leaf node will be our output result.
Python Implementation of Decision Tree Algorithm
Now, we are going to see how the decision tree algorithm is implemented in Python language. We will use user_data.csv file for this implementation. Follow the steps given below to implement the decision tree algorithm.
- Data Pre-processing step.
- Fitting a Decision-Tree algorithm to the Training set.
- Predicting the test result.
- Test accuracy of the result (Creation of Confusion matrix).
- Visualizing the training set result.
- Visualizing the test set result.
1. Data pre-processing step
Below is the code for its implementation
1. import numpy as nm 2. import matplotlib.pyplot as mtp 3. import pandas as pd 4. data_set= pd.read_csv('user_data.csv') 5. 6. x= data_set.iloc[:, [2,3]].values 7. y= data_set.iloc[:, 4].values 8. 9. from sklearn.model_selection import train_test_split 10. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0) 11. 12. from sklearn.preprocessing import StandardScaler 13. st_x= StandardScaler() 14. x_train= st_x.fit_transform(x_train) 15. x_test= st_x.transform(x_test)
2. Fitting a decision tree algorithm to the training set
Now, we will fit the decision tree algorithm in the training set. For this reason, we have to import DecisionTreeClassifier class from sklearn.tree library. Below is the code for its implementation:
1. From sklearn.tree import DecisionTreeClassifier 2. classifier= DecisionTreeClassifier(criterion='entropy', random_state=0) 3. classifier.fit(x_train, y_train)
3. Predicting the test result
Now, we will predict the test result. Below is the code for its implementation:
1. y_pred= classifier.predict(x_test)
4. Checking the test accuracy of the result
Now, we will check whether the test result is correct or not. Below is the code for its implementation:
1. from sklearn.metrics import confusion_matrix 2. cm= confusion_matrix(y_test, y_pred)
5. Visualization of training set result
Now, we will visualise the result of training set. For this reason, we will use one graph. Below is the code for its implementation:
1. from matplotlib.colors import ListedColormap 2. x_set, y_set = x_train, y_train 3. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =0.01), 4. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01)) 5. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape), 6. alpha = 0.75, cmap = ListedColormap(('purple','green' ))) 7. mtp.xlim(x1.min(), x1.max()) 8. mtp.ylim(x2.min(), x2.max()) 9. fori, j in enumerate(nm.unique(y_set)): 10. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1], 11. c = ListedColormap(('purple', 'green'))(i), label = j) 12. mtp.title('Decision Tree Algorithm (Training set)') 13. mtp.xlabel('Age') 14. mtp.ylabel('Estimated Salary') 15. mtp.legend() 16. mtp.show()
6. Visualization of test set result
For visualization of test set result, we just need to implement the same code which we used in the implementation of training set result. Here, the only change is that the training set will be implemented by test set. Below is the code for its implementation:
1. from matplotlib.colors import ListedColormap 2. x_set, y_set = x_test, y_test 3. x1, x2 = nm.meshgrid (nm.arange (start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step = 0.01 ), 4. nm.arange ( start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01 ) ) 5. mtp.contourf ( x1, x2, classifier.predict ( nm.array ( [ x1.ravel(), x2.ravel() ] ).T ).reshape (x1.shape) , 6. alpha = 0.75, cmap = ListedColormap ( ( 'purple', 'green' ) ) ) 7. mtp.xlim ( x1.min(), x1.max() ) 8. mtp.ylim( x2.min(), x2.max() ) 9. fori, j in enumerate ( nm.unique( y_set ) ): 10. mtp.scatter (x_set [y_set == j, 0], x_set [y_set == j, 1], 11. c = ListedColormap(('purple', 'green'))(i), label = j) 12. mtp.title('Decision Tree Algorithm(Test set)') 13. mtp.xlabel('Age') 14. mtp.ylabel('Estimated Salary') 15. mtp.legend() 16. mtp.show()
Advantages of the Decision Tree
- This decision tree algorithm can help you solving the problems related to decision making.
- In this algorithm, we can check all possible results.
- The working process of this algorithm is more or less same as the working process of our brain while decision making. So, human can understand the logic easily.
- Here, in this algorithm, we need less cleaning of data.
Disadvantages of the Decision Tree
- If we have many class labels then the decision tree may become very complex.
- The decision tree algorithm may face over fitting issues. To solve this problem, we can use random forest algorithm.
- As we have seen in previous examples and implementations, the decision tree has many layers. So, it can be a little bit complex.