ML Cancer Cell Classification Using Scikit-learn
Before doing the Cancer Cell classification using Scikit-learn we will first know what is Machine Learning which is the subfield of the Artificial Intelligence. A subset of artificial intelligence is machine learning (AI). Instead, then explicitly programming computers to do something, it focuses on teaching them to learn from data and get better over time. In machine learning, algorithms are taught to search through massive amounts of data for patterns and correlations before deciding what to do with the information and making predictions. Applications that use machine learning get better over time and become more precise as they access more data.
How the Machine Learning works:
Diverse kinds of machine learning models that employ different algorithmic strategies make up machine learning. Four learning models—supervised, unsupervised, semi-supervised, or reinforcement—can be utilised, depending on the type of data and the desired result. Depending on the data sets being used and the desired outcomes, one or more algorithmic strategies may be used within each of those models. In essence, machine learning algorithms are made to categorise objects, look for patterns, forecast results, and make conclusions. When dealing with complicated and more unexpected data, it is feasible to employ one algorithm at a time or combine several algorithms to get the highest level of accuracy.
How does Supervised Learning work?
The first of four machine learning models is supervised learning. Machine learning algorithms that use supervised learning teach by doing. The "input" and "output" data pairs used in supervised learning models are labelled with the desired value on the output. Let's imagine, for illustration, that the objective is for the machine to distinguish between daisies and pansies. A daisy and a pansy image are both included in one binary input data pair. Picking the daisy is the intended result for that particular combination, therefore it will be pre-identified as the right result.
The system builds up all of this training data over time using an algorithm, which then starts to identify correlated similarities, differences, and other points of logic, eventually enabling it to anticipate the outcomes of daisy-or-pansy questions on its own. It is the same as asking a child to demonstrate their work and justify their reasoning after handing them a series of problems with an answer key. Many of the programmes we use on a daily basis, like product recommendation engines and traffic monitoring apps like Waze that forecast the quickest route at various times of day, use supervised learning models.
How does Unsupervised Learning works?
The second of the four machine learning models is unsupervised learning. There isn't an answer key for models of unsupervised learning. Using all the relevant, available data, the computer analyses the input data, much of which is unlabelled and unstructured, and starts to spot patterns and connections. Unsupervised learning is often compared to how people view the world. We group things together using our experience and intuition. Our capacity to classify and identify something becomes more precise as we encounter more and more instances of it. The amount of data that is input and made available for machines to use defines their "experience." Applications for unsupervised learning include market research, cybersecurity, DNA sequence analysis, facial recognition, and others.
What does "Semi-Supervised Learning" necessarily involve?
The third of four machine learning models is semi-supervised learning. All data would be organised and categorised before being entered into a system in an ideal world. However, as it is obviously not possible, semi-supervised learning can be used in situations where there are large amounts of unstructured, raw data. Small bits of labelled data are inputted into this model to supplement sets of unlabelled data. The labelled data essentially gives the algorithm a head start and can significantly increase learning speed and accuracy. The machine is given instructions to examine the labelled data for correlating characteristics that could be applied to the unlabelled data by means of a semi-supervised learning technique.
There are, however, hazards involved with this paradigm, where errors in the labelled data get picked up and repeated by the system, as detailed in this MIT Press research article. Businesses that employ semi-supervised learning most successfully make sure that best practise guidelines are in place. Semi-supervised learning is utilised in sophisticated medical research, such as protein categorization, high-level fraud detection, and speech and linguistic analysis.
Reinforcement Learning: What is it?
The fourth machine learning model is reinforcement learning. In supervised learning, the machine receives the solution manual and picks up new information by identifying relationships among all the right answers. A list of permissible actions, rules, and probable end states are inputted rather than an answer key in the reinforcement learning approach. Machines can learn by doing when the algorithm's desired aim is fixed or binary. However, when the desired outcome is mutable, the system must learn through reward and experience. In models of reinforcement learning, the "reward" takes the form of a monetary value that is programmed into the algorithm as something the system aims to acquire.
This model can be compared to teaching someone how to play chess in many ways. It would undoubtedly be impossible to try to demonstrate to them every feasible move. Instead, you lay forth the guidelines, and through practise, they hone their skills. Rewards include both the ability to win the game and take your opponent's pieces. The development of computer games, high-stakes stock market trading, and automated price bidding for online advertising purchasers are examples of applications for reinforcement learning.
Up to now we had seen how the machine learning is used now we will see how this machine learning is used to develop the cancer cell classification using scikit-learn
Many difficulties in the real world can be solved with machine learning.
Let's categorise cancer cells according to their characteristics and determine whether they are "malignant" or "benign." Scikit-learn will be used to solve a machine learning issue. A Python programming language package for machine learning, data mining, and data analysis is called Scikit-learn.
Database which will be used for the Scikit Learning is:
There are a few compact standard datasets included with Scikit-learn, thus there is no need to download any files from other websites. The Breast cancer Wisconsin (diagnostic) dataset will be used for our machine learning challenge. The collection contains various information about breast cancer tumours, as well as labels designating whether they are malignant or benign. Utilizing the following function, it can be loaded:
load_breast_cancer([return_X_y])
The data set includes information on 30 attributes or features, such as the radius, texture, perimeter, area, etc., of a tumour and contains 569 instances or data of 569 tumours. These characteristics will be used to train our model.
Putting in the required modules now we will be installing the required things to be done
The Python module "Scikit-learn" is required for this machine learning project. This scikit is to install so that we can perform the requied machine learning techniques to detect which is required for us Run the following command on the command prompt to download and instal it if it isn't already on your computer:
!pip3 install scikit-learn
pip install means the python uses this to install the existing software so that we can perform the required operations.
Noted: You can use any IDE for this project, although Jupyter notebook is strongly advised. This is due to the fact that Python is an interpreted language, making it possible to fully utilise it by executing a few lines of code and seeing and understanding what happens step-by-step as opposed to creating the entire script just once and running it.
So, using of the Jupyter notebook makes us easy understanding of the things that we had performed step by step it makes us clear understand about the instructions.
Step 1: Run the following line at the command prompt to install it:
pip install jupyter
This makes the python to install the Jupyter also for using the operations on the Jupyter.
Now we will see the process how the scikit learn in python is used to make the dataset classified in a step by step manner.
#First importing the python module
import sklearn
#importing the dataset in python for using the classification
import sklearn.datasets import load_breast_cancer
Step 2: Now we will load the dataset in the variable.
#now we will load the dataset in the variable name data_info
data_info=load_breast_cancer()
'target-names' (the meaning of the labels), 'target' (the classification labels), 'feature names' (the meaning of the features), and 'data' are the crucial attributes from that dataset that we must take into account (the data to learn).
Step 3: Sorting and analysing the data.
Let's organise the data first and then use the print() function to see what the dataset contains to gain a better grasp of what the dataset contains and how we can utilise the data to train our model.
# Organize our data
label_names = data_info['target_names']
labels = data_info['target']
feature_names = data_info['feature_names']
features = data_info['data']
Now we will use the print function so that we can know what the available data items in the data are information
# looking at the data present
print(label_names)
Output:
['malignant' 'benign']
Now we can see that all the i data items are maybe malignant or benign.
Now we will print the label values so that we can see what are the labels present inside the dataset
print(labels)
Output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0
1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 0 0 0 0 0 0 1]
With the output we got we can expect the binary values of zero and one where zero difference malignant tumours and 1 represents benign tumors from the dataset
Step 4: Group the data into sets.
We must test the model on hypothetical data in order to determine the accuracy of our classifier. As a result, we will divide our data into two sets, a training set, and a test set, before creating the model. We will train and test the model on the training data before applying the trained model to the test data to make predictions.
The train test split() method, which is a part of the sklearn package, automatically separates the data into these sets. This function will be used to divide the data.
# importing the function
from sklearn.model_selection import train_test_split
# splitting the data
train, test, train_labels, test_labels = train_test_split(features, labels,test_size = 0.33, random_state = 42)
Using the test size option, the train test split() algorithm randomly divides the data. We have divided 33% of the original data into test data in this case (test). The training data are the remaining data (train). Additionally, we have labels for the train and test variables, referred to as train labels and test labels, respectively.
You can consult the official documentation to find out more information on how to utilise the train test split() function.
Step 5: Building the Model
There are numerous machine learning models available. Each of them has advantages and drawbacks of their own. The Naive Bayes technique, which often excels in binary classification tasks, will be used for this model. First, import the GaussianNB module and call the GaussianNB() function to initialise it. The model is then trained by using the fit() method to fit it to the dataset's data.
# impoting the machinelearning module into the sklearn
from sklearn.naive_bayes import GaussianNB
# now initializing the classifier so that it will be used in the next step
gnb = GaussianNB()
# train the classifier variable.
model = gnb.fit(train, train_labels)
We may use the trained model to make predictions on our previously prepared test set after the training is over. The built-in predict() method, which provides an array of predictions for each data instance in the test set, will be used to achieve this. The print() method will then be used to print our forecasts.
# making the values for the predictions of that values using the gnb module which was imported earlier.
predictions = gnb.predict(test)
# now print the prediction values.
print(predictions)
Output:
[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0
0 1 1 ]
From the above output we can predict that the function had been returned an array of zeros and ones so by the help of this ones and zero we can predict whether it will be a malignant or benign.
Step 6: Measuring the precision of the trained model.
As we now have projected values, we can assess the correctness of our model by contrasting it with the test set's actual labels, or by comparing predictions to test labels. We will make use of the accuracy score() function that is already included in the sklearn module for this purpose.
# importing the accuracy measuring function
from sklearn.metrics import accuracy_score
# evaluating the accuracy
print(accuracy_score(test_labels, predictions))
Output:
0.9414893617021277
It turns out that this machine learning classifier, which uses the Naive Bayes algorithm, can accurately predict whether a tumour is malignant or benign with a 94.15% accuracy rate.