Python Tutorial

Introduction Python Features Python Applications Python System requirements Python Installation Python Examples Python Basics Python Indentation Python Variables Python Data Types Python IDE Python Keywords Python Operators Python Comments Python Pass Statement

Python Conditional Statements

Python if Statement Python elif Statement Python If-else statement Python Switch Case

Python Loops

Python for loop Python while loop Python Break Statement Python Continue Statement Python Goto Statement

Python Arrays

Python Array Python Matrix

Python Strings

Python Strings Python Regex

Python Built-in Data Structure

Python Lists Python Tuples Python Lists vs Tuples Python Dictionary Python Sets

Python Functions

Python Function Python min() function Python max() function Python User-define Functions Python Built-in Functions Python Recursion Anonymous/Lambda Function in Python apply() function in python Python lambda() Function

Python File Handling

Python File Handling Python Read CSV Python Write CSV Python Read Excel Python Write Excel Python Read Text File Python Write Text File Read JSON File in Python

Python Exception Handling

Python Exception Handling Python Errors and exceptions Python Assert

Python OOPs Concept

OOPs Concepts in Python Classes & Objects in Python Inheritance in Python Polymorphism in Python Python Encapsulation Python Constructor Python Super function Python Static Method Static Variables in Python Abstraction in Python

Python Iterators

Iterators in Python Yield Statement In Python Python Yield vs Return

Python Generators

Python Generator

Python Decorators

Python Decorator

Python Functions and Methods

Python Built-in Functions Python String Methods Python List Methods Python Dictionary Methods Python Tuple Methods Python Set Methods

Python Modules

Python Modules Python Datetime Module Python Math Module Python Import Module Python Time ModulePython Random Module Python Calendar Module CSV Module in Python Python Subprocess Module

Python MySQL

Python MySQL Python MySQL Client Update Operation Delete Operation Database Connection Creating new Database using Python MySQL Creating Tables Performing Transactions

Python MongoDB

Python MongoDB

Python SQLite

Python SQLite

Python Data Structure Implementation

Python Stack Python Queue Python Linked List Python Hash Table Python Graph

Python Advance Topics

Speech Recognition in Python Face Recognition in Python Python Linear regression Python Rest API Python Command Line Arguments Python JSON Python Subprocess Python Virtual Environment Type Casting in Python Python Collections Python Attributes Python Commands Python Data Visualization Python Debugger Python DefaultDict Python Enumerate

Python 2

What is Python 2

Python 3

Anaconda in Python 3 Anaconda python 3 installation for windows 10 List Comprehension in Python3

How to

How to Parse JSON in Python How to Pass a list as an Argument in Python How to Install Numpy in PyCharm How to set up a proxy using selenium in python How to create a login page in python How to make API calls in Python How to run Python code from the command prompt How to read data from com port in python How to Read html page in python How to Substring a String in Python How to Iterate through a Dictionary in Python How to convert integer to float in Python How to reverse a string in Python How to take input in Python How to install Python in Windows How to install Python in Ubuntu How to install PIP in Python How to call a function in Python How to download Python How to comment multiple lines in Python How to create a file in Python How to create a list in Python How to declare array in Python How to clear screen in Python How to convert string to list in Python How to take multiple inputs in Python How to write a program in Python How to compare two strings in Python How to create a dictionary in Python How to create an array in Python How to update Python How to compare two lists in Python How to concatenate two strings in Python How to print pattern in Python How to check data type in python How to slice a list in python How to implement classifiers in Python How To Print Colored Text in Python How to open a file in python How to Open a file in python with Path How to run a Python file in CMD How to change the names of Columns in Python How to Concat two Dataframes in Python How to Iterate a List in Python How to learn python Online How to Make an App with Python How to develop a game in python How to print in same line in python How to create a class in python How to find square root in python How to import numy in python How to import pandas in python How to uninstall python How to upgrade PIP in python How to append a string in python How to comment out a block of code in Python How to change a value of a tuple in Python

Sorting

Python Sort List Sort Dictionary in Python Python sort() function Python Bubble Sort

Programs

Factorial Program in Python Prime Number Program in Python Fibonacci Series Program in Python Leap Year Program in Python Palindrome Program in Python Check Palindrome In Python Calculator Program in Python Armstrong Number Program in Python Python Program to add two numbers Anagram Program in Python Number Pattern Programs in Python Even Odd Program in Python GCD Program in Python Python Exit Program Python Program to check Leap Year Operator Overloading in Python Pointers in Python Python Not Equal Operator Raise Exception in Python Salary of Python Developers in India What is a Script in Python

Misc

Introduction to Scratch programming SKLearn Clustering SKLearn Linear Module Standard Scaler in SKLearn Python Time Library SKLearn Model Selection Standard Scaler in SKLearn Accuracy_score Function in Sklearn Append key Value to Dictionary in Python Cross Entropy in Python Cursor in Python Data Class in Python How to Install Tweepy in Python Imread Python Program of Cumulative Sum in Python Python Program for Linear Search Python Program to Generate a Random String Read numpy array in Python Scrimba python Sklearn linear Model in Python Scraping data in python Accessing Key-value in Dictionary in Python Find Median of List in Python Linear Regression using Sklearn with Example Problem-solving with algorithm and data structures using Python Python 2.7 data structures Python Variable Scope with Local & Non-local Examples Arguments and parameters in Python Assertion error in python Programs for Printing Pyramid Patterns in Python _name_ in Python Amazon rekognition using python Anaconda python 3.7 download for windows 10 64-bit Android apps for coding in python Augmented reality in python Best app for python Difference between Perl and Python Not supported between instances of str and int in python Python comment symbol Python Complex Class Python IDE names Selection Sort Using Python Hypothesis Testing in Python Idle python download for Windows Insertion Sort using Python Merge Sort using Python Python - Binomial Distribution Python Logistic Regression with Sklearn & Scikit Python Random shuffle() method Python variance() function Python vs HTML Removing the First Character from the String in Python Adding item to a python dictionary Best books for NLP with Python Best Database for Python Count Number of Keys in Dictionary Python Cross Validation in Sklearn Drop() Function in Python EDA in Python Excel Automation with Python Python Program to Find the gcd of Two Numbers Python Web Development projects Adding a key-value pair to dictionary in Python Python Euclidean Distance Python Filter List Python Fit Transform Python e-book free download Python email utils Python range() Function Python random.seed() function What is the re.sub() function in Python Python PPTX Python Pickle Python Seaborn Python Coroutine Python EOL Python Infinity Python math.cos and math.acos function Python Project Ideas Based On Django Reverse a String in Python Reverse a Number in Python Python Word Tokenizer Python Trigonometric Functions Python try catch exception GUI Calculator in Python Implementing geometric shapes into the game in python Installing Packages in Python Python Try Except Python Sending Email Socket Programming in Python Python CGI Programming Python Data Structures Python abstract class Python Compiler Python K-Means Clustering NSE Tools In Python Operator Module In Python Palindrome In Python Permutations in Python Pillow Python introduction and setup Python Functionalities of Pillow Module Python Argmin Python whois Python JSON Schema Python lock Return Statement In Python Reverse a sentence In Python tell() function in Python Why learn Python? Write Dictionary to CSV in Python Write a String in Python Binary Search Visualization using Pygame in Python Latest Project Ideas using Python 2022 Closest Pair of Points in Python ComboBox in Python Python vs R Best resources to learn Numpy and Pandas in python Check Letter in a String Python Python Console Python Control Statements Convert Float to Int in Python using Pandas Difference between python list and tuple Importing Numpy in Pycharm Python Key Error Python NewLine Python tokens and character set Python Strong Number any() Keyword in python Best Database in Python Check whether dir is empty or not in python Comments in the Python Programming Language Convert int to Float in Python using Pandas Decision Tree Classification in Python End Parameter in python __GETITEM__ and __SETITEM__ in Python Python Namespace Python GUI Programming List Assignment Index out of Range in Python List Iteration in Python List Index out of Range Python for Loop List Subtract in Python Python Empty Tuple Python Escape Characters Sentence to python vector Slicing of a String in Python Executing Shell Commands in Python Genetic Algorithm in python Get index of element in array in python Looping through Data Frame in Python Syntax of Map function in Python After Python What Should I Learn Python AIOHTTP Alexa Python Artificial intelligence mini projects ideas in python Artificial intelligence mini projects with source code in Python Find whether the given stringnumber is palindrome or not First Unique Character in a String Python Python Network Programming Python Interface Python Multithreading Python Interpreter Data Distribution in python Flutter with tensor flow in python Front end in python Iterate a Dictionary in Python Iterate a Dictionary in Python – Part 2 Allocate a minimum number of pages in python Assertion Errors and Attribute Errors in Python Checking whether a String Contains a Set of Characters in python Python Control Flow Statements *Args and **Kwargs in Python Bar Plot in Python Conditional Expressions in Python Function annotations() in Python How to Write a Configuration file in Python Image to Text in python import() Function in Python Import py file in Python Multiple Linear Regression using Python Nested Tuple in Python Python String Negative Indexing Reading a File Line by Line in Python Python Comment Block Base Case in Recursive function python ER diagram of the Bank Management System in python Image to NumPy Arrays in Python NOT IN operator in Python One Liner If-Else Statements in Python Sklearn in Python Cube Root in Python Python Variables, Constants and Literals What Does the Percent Sign (%) Mean in Python Creating Web Application in python Notepad++ For Python PyPi TensorFlow Python | Read csv using pandas.read_csv() What is online python free IDE What is Python online compiler Run exec python from PHP What are the Purposes of Python What is Python compiler GDB Python coding platform Python Classification Python | a += b is not always a = a + b PyDev with Python IDE Character Set in Python Best Python AI Projects _dict_ in Python Python Ternary Operators Self in Python Python vs Java Python Modulo Python Packages Python Syntax Python Uses Python Bitwise Operators Python Identifiers Python Matrix Multiplication Python AND Operator Python Logical Operators Python Multiprocessing Python Unit Testing __init__ in Python Advantages of Python Is Python Case-sensitive when Dealing with Identifiers Python Boolean Python Call Function Python History Python Image Processing Python main() function Python Permutations and Combinations Difference between Input() and raw_input() functions in Python Conditional Statements in python Confusion Matrix Visualization Python Nested List in Python Python Algorithms Python Modules List Difference between Python 2 and Python 3 Is Python Case Sensitive Method Overloading in Python Python Arithmetic Operators Assignment Operators in Python Is Python Object Oriented Programming language Python Division Python exit commands Continue And Pass Statements In Python Colors In Python Convert String Into Int In Python Convert String To Binary In Python Convert Uppercase To Lowercase In Python Convert XML To JSON In Python Converting Set To List In Python Covariance In Python CSV Module In Python Decision Tree In Python Difference Between Yield And Return In Python Dynamic Typing In Python BOTTLE Python Web Framework How to Install Scikit-Learn Introducing modern python computing in simple packages Python vs PHP Reason for Python So Popular Returning Multiple Values in Python Spotify API in Python Spyder (32-bit) - Free download Time. Sleep() in Python Traverse Dictionary in Python What is Ipython shell YOLO Python Nested for Loop in Python Data Structures and Algorithms Using Python | Part 1 Data Structures and Algorithms using Python | Part 2 ModuleNotFoundError No module named 'mysql' in Python N2 in Python XGBoost for Regression in Python

Data Distribution in python

Before discussing data distribution, we will discuss each word that is data and Distribution.

What is meant by data?

  1. A collection of raw facts and figures is called data. The word "raw" means that the facts are unprocessed.
  2. Data is collected from different sources. It is collected for different purposes.
  3. Data may consist of numbers, characters, symbols or pictures etc. 

Example of data

Students fill out an admission form when they get college admission. The form consists of raw facts about the students. These raw facts are the student's name, father's name, address etc. The main purpose of collecting the data is to maintain the records of the students during their study period in the college or any educational institution.

What kinds of data exist?

  1. Property records
  2. Census records
  3. Animal control records
  4. Test scores
  5. Arrest records
  6. Auto accident records
  7. Budgets
  8. Criminal/court records and more!

What is meant by Distribution?

In a programming language, the word distribution means the phase which will follow some packages. The package will be on any distribution medium, such as a compact disk, or maybe it is located on a server where the customers can download it electronically.

In python, Distribution is a collection that contains an implementation of python along with a group of libraries or tools.

The best python distributions are Anaconda Python, PyPy, Jython, Active Python, CPython etc.

Now, we will learn about Data Distribution.

Data Distribution

In the real world, the data sets are big, and it is more difficult to gather this real-world data at the early stage of a given project.

We get doubt about how can we get this type of data sets.

For creating the big sets for testing, we use modules in python like NumPy that contain many methods to create random data sets that can be of any size. Let us discuss an example of this.

Example

Let us create an array containing floating-point numbers containing 50 between 0 and 1.

import numpy
x = numpy.random.uniform(0.0, 1.0, 50)
print(x)


Output

Data Distribution in Python

The above output shows that the floating-point numbers are displayed between 0 and 1. The total number of floating-point numbers is 50, as we have given as an input.

Now, we will discuss "Histogram".

What is Histogram?

A histogram is a bar graph that shows data in intervals. It is also called the graphical representation of a frequency distribution (in complete form) in a rectangle with class intervals as bases and the corresponding frequencies as heights. There should be no gap between any two successive rectangles. It is also fine when there is a gap.

It has adjacent bars at intervals. The histogram shown here illustrates the Distribution of weights in KG of 40 persons of a locality.

In other words, we can draw a histogram with the data we collected to visualize the data set. 

Here, we will use the python module Matplotlib to draw a histogram.

Let us check an example showing a histogram.

Example program

import numpy
import matplotlib.pyplot as plt
x = numpy.random.uniform(0.0, 5.0, 250)
plt.hist(x, 5)
plt.show()

Output

Data Distribution in Python

From the above histogram, we use the array from the example for drawing the histogram with 5 bars.

The first bar will represent how many values are in the array between 0 and 1.

The second bar represents how many values are in the array between 1 and 2.

The third bar represents how many values are in the array between 2 and 3.

And so on.

From this, the result will be as follows:

  1. There are 52 values between 0 and 1
  2.  There are 48 values between 1 and 2
  3. There are 49 values between 2 and 3
  4. There are 51 values between 3 and 4
  5. There are 50 values between 4 and 5

NOTE: From the histogram, the array values are random numbers, and also, it will not show the exact result on our system.

There are many types of distributions, such as:

  1. Discrete Distribution
  2. Geometric Distribution
  3. Continuous Distribution
  4. Lognormal Distribution
  5. Weibull Distribution
  6. Non-normal Distribution

We will discuss these types later, and now let us know how to evaluate data distribution.

We all observe that data distribution is generated by its number of peaks, uniformity, symmetry possession, and skewness. Skewness means the measure of the lack of symmetry in data distribution.

We discussed the definition of data and distribution and data distribution and learned about a random array of a given size and between two given values.

Here in this article, we will learn how to create an array in which the values concentrate only on a given value.

In the theory of probability, this type of distribution is called the normal data distribution, or it is also called Gaussian data distribution. This name arrived from a scientist called Carl Friedrich, who devised a formula for this data distribution.

An example of this normal data distribution is as follows:

# the three lines that will make the output able to draw
import sys
import matplotlib
matplotlib.use('Agg')
import numpy
import matplotlib.pyplot as plt
x = numpy.random.normal(5.0, 1.0, 100000)
plt.hist(x, 100)
plt.show()
#the two lines will make the compiler to draw
plit.savefig(sys.stdout.buffer)
sys.stdout.flush()


Output

Data Distribution in Python

A normal data distribution graph is also known as a "bell curve" because it is in the shape of a bell.

From the above program, we have used the "NumPy. random. normal ()" method, giving input with 100000 values, which will draw a histogram with 100 bars.

We also specified that the mean value is 5.0 and the standard deviation is 1.0. It means that the values must concentrate only around 5.0, and it will rarely cross the value of the mean, which is 1.0.

The above histogram shows that most values are between 4.0 and 6.0, and the top value is 5.0.

Now let us discuss a topic called "Scatter Plot".

Scatter Plot

The scatter plot is a diagram where a dot will represent every value belonging to the data set.

Data Distribution in Python

In python, a matplotlib module method is used for drawing these scatter plots. For this, we need two arrays of the same length, one is for values on the x-axis, and the other array is for the values on the y-axis.

For example, let us take two arrays named x and y

x = [4,7,6,8,16,11,12,9,8,6,5]
y = [55,34,67,89,90,79,88,96,100,99,87]

 The array "x" will represent the car age of every car.

The array "y" will represent the speed of every car.

Let us take an example using the scatter () method for drawing scatter plot diagrams.

An example program showing scatter () method

import matplotlib.pyplot as plt
x = [4,13,6,8,16,11,12,9,8,6,5]
y = [55,34,67,89,90,79,88,96,100,99,87]
plt.scatter(x,y)
plt.show()

Output

Data Distribution in Python

Explanation of the program

The x-axis will represent the car's age, and the y-axis will represent the car's speed.

We observed from the diagram that the fastest speed of the car is 100, and the age of it is 8 years, and the slowest car is 7, and its age is 13.

It shows that the new car's speed is more than the old car and there will also be a coincidence, and we only registered 11 cars.

Random Data Distributions

In python, the data sets contain thousands or millions of values.

We might not have the real-world data when we test an algorithm; for this, we have to generate randomly generated values.

We learned the above about the NumPy module; it will help us with that.

So, let us create two arrays filled with 500 random numbers from a normal data distribution.

The first and foremost array will have the set of the mean is 3.0 having a standard deviation of 1.0.

The second array consists of the mean set is 4.0 and has a standard deviation of 2.0.

import numpy
importmatplotlib. pyplot as plt
x = numpy.random.normal(3.0, 1.0, 500)
y = numpy.random.normal(4.0, 2.0, 500)
plt.scatter(x, y)
plt.show()


Output

Data Distribution in Python

Explanation of Scatter Plot

From the above graph, we can see that the program of scatter plot results in a graph consisting of dots, which are results around the value "3" on the x-axis and value "4" on the y-axis and these values are given as input from us. We also observed that the spread of dots or result of dots is more on the y-axis compared to the axis x.

Now, we will learn a new concept called "Regression".

Regression

The term regression will be used when we are trying to find the relation among variables.

In python, machine learning and statistical modelling, this relationship is used to predict future events' results.

Here in this regression, there are many types:

  1. Linear Regression
  2. Polynomial Regression
  3. Multiple Regression

In python, regression is a form of predictive modelling technique which investigates the relationship between a dependent and an independent variable.

It involves graphing a line over a set of data points that most closely fits the overall shape of the data. The regression shows the changes in a dependent variable on the y-axis and the explanatory variable on the x-axis.

Uses of Regression

  1. Determining the strength of predictors. Here, it is used to determine the strength of the independent variables' effect on the dependent variables.
  2. Forecasting an effect, and in this, the regression is used to forecast the effects or impact of changes in one or more independent variables.
  3. Trend forecasting the regression analysis is used to predict trends and future values.

Types of Regression

There are three types of regression, such as:

  1. Linear Regression: In simple terms, we know and are interested in y=mx+c, which is the equation of a straight line. It explains the correlation between 'x' and 'y' variables. It means every value of 'x' has a corresponding value of 'y' if it is continuous.
    In linear regression, the data is modelled using a straight line. It uses the relation between the data points to draw a straight line through them. The result produced as a straight line is used to predict the future values.
Data Distribution in Python

In python, predicting the values of the future is very important.

Now, we will learn how it will work.

How does It Work?

In python, many methods and libraries for finding the relationship between data points are used to draw a linear regression line.

We will now discuss using these methods instead of a mathematical formula.

Linear regression is used with continuous variables. The output or linear regression prediction is the variable's value. Here, we use measured by loss, R squared, Adjusted R squared etc. are used for accuracy and goodness of fit.

Selection Criteria

  • Classification and Regression Capabilities
  • Regression models predict a continuous variable such as sales made on a day or a city's temperature. Reigning on a polynomial like a straight line to fit a data set poses a real challenge in building a classification capability.
  • For example, we fit a line with ten points that we have and now imagine if we add some more data points to it. In order o fit it, we have to change our existing model; that is, we have to change the threshold itself.
  • Hence, the linear model is not good for classification models.

Data Quality

Each missing value removes one data point that could optimize the regression. In simple linear regression, the outliers can significantly disrupt the outcome. If you remove the outliers, your model will become very good.

Computational Complexity

Linear regression is often not computationally expensive compared to the decision tree or the clustering algorithm. The order of complexity for 'n' training example and "X" features usually falls in either BigO(X)^2 or BigO(Xn).

Comprehensible and Transparent

The linear regression is easily understandable and transparent. A simple mathematical notation can represent by anyone and can be understood very easily.

So, these are some criteria based on which we will select the linear regression algorithm.

Where is Linear Regression used?

1. Evaluating trends and sales estimates.

 Linear regression can be used in business to evaluate trends and make estimates or focus.

Suppose a company's sales have increased steadily every month for the past few years. In that case, conducting a linear analysis of the sales data with monthly sales on the y-axis and time on the x-axis will give us a line predicting the upwards trends in the sale; after creating the trend line, the company could use the slope of the lines to focus sale in coming months.

2. Analyzing the Impact of Price Changes

Linear regression can be used to analyze the effect of pricing on consumer behaviour. For instance, if a company changes the price of a certain product several times, then it can record the quantity itself or each price level and then perform a linear regression with sold quantity as a dependent variable and price as the dependent variable. It will result in a line that depicts the extent to which they reduce their product consumption as the price increases.

Polynomial Regression

It is opposite to the linear regression that we discussed previously. When the points of our data do not fit a linear regression that is a straight line through all the data points, it will be a curve or ideal for polynomial regression.

Like linear regression, polynomial regression also uses a relationship between the variables a and b that will find a good way, and we can draw a line through the data points.

How does it work?


We know that python has many methods for finding a relationship between data points and drawing a polynomial regression line.

Here, we will discuss how it works and how to use these python methods apart from using a mathematical formula.

Now, let us discuss an example that shows 13 cars as they are passing through a tollbooth. We have already registered those cars' speed and the time of hours for which those cars are passed.

For drawing the polynomial regression graph, the x-axis represents the time of hours, and the y-axis represents the speed of those cars.

So, let us start by drawing a scatter plot.

import matplotlib.pyplot as plt
x = [4,13,6,8,16,11,12,9,8,6,5]
y = [55,34,67,89,90,79,88,96,100,99,87]
plt.scatter(x,y)
plt.show()



Result

Data Distribution in Python

By drawing a scatter plot graph, we get an idea of how to draw the graph for polynomial regression.

At first, we should import NumPy and matplotliband then draw the line of polynomial regression.

import numpy
import matplotlib.pyplot as plt
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
b = [87, 78, 90, 100, 34, 23, 78, 67, 98, 67, 88, 32, 23]
myexample = numpy.poly1d(numpy.polyfit(a, b, 3))
myline = numpy.linspace(3, 24, 99)
plt.scatter(a, b)
plt.plot(myline, mymodel(myline))
plt.show()

Result

Data Distribution in Python

We need to import the modules.

import numpy
import matplotlib.pyplot as plt


After importing the modules, we need to represent the arrays and the array values of the x and y-axis.

a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
b = [87, 78, 90, 100, 34, 23, 78, 67, 98, 67, 88, 32, 23]


To make a polynomial model, NumPy has a method.

myexample = numpy.poly1d(numpy.polyfit(a, b, 3))


Now, we can specify the line like it displays, so we start at position 3 and end at position 24.

myline = numpy.linspace(3, 24, 99)

Now, we can draw the original scatter plots.

plt.scatter(a, b)


After drawing the scatter plots then, draw the line of polynomial regression.

plt.plot(myline, mymodel(myline))

All the steps are completed, and now we can display the diagram by using:

plt.show()

By learning all these types of regressions, we get doubt about how the relationship is measured. So, let us discuss how this relationship is measured and predict anything.

R-Squared

As discussed above, it is important to know the relationship between the values of the x and y-axis; if there is no relationship between these axes, the polynomial regression will be unable to be used for predicting anything.

So, this relationship is measured with a value called r-squared.

The values of r-squared values range from 0 to 1, where "0" means no relationship and "1" means totally (100%) related.

Python programming language and SkLearn module will compute this value for us, and all we have to do is keep the x and y arrays and feed them.

import numpy
from sklearn.metrics import r2_score
x = [4,13,6,8,16,11,12,9,8,6,5,3,6]
y = [55,34,67,89,90,79,88,96,100,99,87,49,76]
myexample = numpy.poly1d(numpy.polyfit(x, y, 3))
print(r2_score(y, myexample(x)))

Result

Data Distribution in Python

Note: The result we got is 0.49, which shows an average relationship, and there should be more value in a relationship for future predictions.

Predicting future values

We all gathered the information to predict the future values, and now we can use this information to predict future values.

Let us consider an example: Let us now try to predict the car's speed that will pass the tollbooth at around 15P.M:

To do this calculation, we need an array called "myexample" from the example above.

myexample = numpy.poly1d(numpy.polyfit(x, y, 3))

Example

We are predicting the speed of a car that is passing at 15P.M:

import numpy
from sklearn.metrics import r2_score
a = [4,13,6,8,16,11,12,9,8,6,5,3,6]
b = [55,34,67,89,90,79,88,96,100,99,87,49,76]
myexample = numpy.poly1d(numpy.polyfit(a, b, 4))
speed = myexample(15)
print(speed)

Result

Data Distribution in Python

From the above code, we got the result of 58.7, which is the speed of the car passing at "15P.M".

It is not always possible to create a perfect polynomial regression. Sometimes, there may be no best method for predicting future values and now let us create an example for polynomial regression would not be the best method.

Bad fit

The above-discussed things are known as "Bad fit", and we will discuss an example showing the method for bad fit and prove that it is impossible to find the best method for polynomial regression. So, let us discuss an example.

Example

We can prove a bad fit method by taking before discussed examples like arrays, my example, importing numpy and matplotlib and considering the scatter plots.  The values for the x and y-axis must return a bad fit for polynomial regression

import numpy
import matplotlib.pyplot as plt
a = [78, 56, 88, 98, 100, 45, 23, 90, 65, 44]
b = [32, 67, 89, 76, 54, 31, 34, 47, 56, 98]
myexample = numpy.poly1d(numpy.polyfit(a, b, 3))
myline = numpy.linspace(3, 96, 100)
plt.scatter(a, b)
plt.plot(myline, myexample(myline))
plt.show()


Result

Data Distribution in Python

From the above graph, we observed a bad fit line across the data points and hence observed that the regression produced is a bad fit. We can also calculate an r-squared value for the above program.

import numpy
from sklearn.metrics import r2_score
a = [78, 56, 88, 98, 100, 45, 23, 90, 65, 44]
b = [32, 67, 89, 76, 54, 31, 34, 47, 56, 98]
myexample = numpy.poly1d(numpy.polyfit(a, b, 6))
print(r2_score(b, myexample(a)))

Result

Data Distribution in Python

The result is 0.3145, indicating a bad relationship and that the data set is unsuitable for polynomial regression.

There is another type of regression called multiple regression.

Multiple regression

Multiple regression is the same as linear regression; the only difference is that it has more than one independent value; we should try using two or more variables to predict a value.

To understand better, we take an example as a data set, which contains information about car models, weight, volume, and CO2.

So, let us randomly select the column CO2 emission of a car, which will be based on the engine's size; although it has multiple regression, we cannot estimate the correct value, and we can throw more variables for this. So, consider the car's weight to make the prediction better and more accurate.

So, how does it work?

We have modules in a python programming language that will work for the above example. So, for this, we need to importthe Pandas module.

import pandas

Pandas:These are python libraries and are used to initialize data. These are also used for working with data sets. It consists of functions for analyzing, cleaning, exploring, and manipulating data.

The name "pandas" comes from both "panel and data".  Pandas allow us to analyze big data and make conclusions based on statistical theories.

It allows us to read CSV files and return a data frame object.

To read the values of cars from the table, we use,

df = pandas.read_csv("cars.csv")

Let us list the independent values, call this variable "X", and keep all the dependent values in a variable "y"

X = df[['weight', 'volume']]
y = df['CO2’]

Note: It is easy to name the entire list of independent values with an upper-case X and a list of dependent values with a lower-case y.

We should import the sklearn module because we had used the same methods from sklearn.

from sklearn import linear_model

Sklearn module uses LinearRegression() method for creating linear regression objects.

The linear regression object has a method called fit() which takes the independent and dependent values as parameters and fills the regression object with some data, and it describes the relationship:

regr = linear_model.LinearRegression()
regr.fit(X, y)

So, by this relationship, we have an object called regression which predicts CO2 values based on the car's weight and volume.

For predicting the CO2 emission of a car and the weight of the car is 2300kg, and its volume is 1300cm^3:

predictedCO2 = regr.predict([[2300,1300]])

Example

import pandas
from sklearn import linear_model
df = pandas.read_csv("cars.csv")
X = df[['Weight', 'Volume']]
y = df['CO2']
regr = linear_model.LinearRegression()
regr.fit(X, y)
#For predicting the emission of Co2 for a car where the weight is 2300kg, and the volume is 1300cm3:
predictedCO2 = regr.predict([[2300, 1300]])
print(predictedCO2)

Result

[107.2087328]

The result explains that when we take a car with a 1.3-litre engine and a weight of 2300kg, it will release 107.208 grams of CO2 for driving one kilometer.

Coefficient

A coefficient is a number or constant present before a variable in an algebraic expression. It describes the relationship with an unknown number.

For example, take a variable "x" then "2x" means double of "x". Here, "x" is an unknown variable, and the number "2" is the coefficient of "x".

We all had learned molecules in chemistry. Now, we take an example using this case. The coefficient of the CO2 molecule is "1"; there is nothing before the term. The weight and volume against CO2 tell us what will happen when we increase or decrease one of the independent values.

Example

import pandas
from sklearn import linear_model
df = pandas.read_csv("cars.csv")
x = df[['weight', 'volume']]
y = df['CO2']
regr = linear_model.linearRegression()
regr.fit(X ,y)
print(regr.coef_)


Result

[0.00755 0.00780]

Explanation of result

The above result of the coefficient code represents the coefficient values of CO2 weight and value.

Weight: 0.00755095

Volume: 0.00780526

These weight and volume values describe that if the weight increase by 11kg, then the CO2 emission increases by 0.00755g.

If the size of the engine or volume is increased by 1cm^3, the emission of CO2 will increase by 0.00780g.

We have already mentioned that if a car with a 1300cm^3 engine weighs 2300g, the CO2 emission will be approximately 107g.

Let us discuss another example but change the values from 2300 to 3300. Then, the program will be as below:

import pandas
from sklearn import linear_model
df = pandas.read_csv("cars.csv")
x = df[['weight', 'volume']]
y = df['CO2']
regr = linear_model.linearRegression()
regr.fit(X ,y)
predictedCO2 = regr.predict([[330, 1300]])
print(predictedCO2)


Result

114.75968007

From the above result, we had given a car with a 1.3-litre engine that weighed 3300kg, and it releases 115grams of CO2 approximately for each kilometre the car drives.

The above value needs some calculation to process the program's backside.

Calculation:

17.2087328 + (1000*0.00755095) =114.75968

Binomial distribution

It is a part of the data distribution model that deals with the probability of winning an event which has two possible outcomes in many experiments.

For example, when we toss a coin, it always gives a head or a tail.

Let us move further on this topic. When the probability of finding exactly 3 heads for tossing a coin, immediately 10 times is examined during the binomial distribution.

For the binomial distribution, we use a library called "seaborn", which has many in-built functions, to create the graphs of such probability distribution.

Here, we also use a package called "scipy" that helps us for creating the binomial distribution.

from scipy.stats import binom
import seaborn as sb
binom.rvs(size=10,n=20,p=0.8)
data_binom = binom.rvs(n=20,p=0.8,loc=0,size=1000)
ax = sb.distplot(data_binom, kde=True, color='blue', hist_kws={"linewidth": 25,'alpha':1})
ax.set(xlabel='Binomial', ylabel='Frequency')

Output is as follows:

Data Distribution in Python

Correlation in python

The correlation analysis deals with the association between two or more variables. The degree of relationship between the variables under consideration is measured through correlation analysis. So, the correlation measure is called "correlation coefficient" or "correlation index".

A simple example is a correlation between parents and their offspring and the product price and its supplied quantity.

Chi-square test in python

The chi-square test in python is a statistical method that determines if there is any correlation between two categorical variables.

Both these categorical variables should belong to the same population, and they should be a part like Yes or No, Male or female, Red or green etc.

Let us consider a simple and general example. We can build a dataset with observations like people's tastes while buying ice cream, and there is a chance to correlate the gender of a person with the ice cream flavour they prefer.

If we find a correlation, we can plan for appropriate stock of flavours by knowing the total count of genders who visited.

In the python program, we can use various functions and here, we use a library called "numpy" to perform the chi-square test.

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 100)
fig,ax = plt.subplots(1,1)
linestyles = [':', '--', '-.', '-']
deg_of_freedom = [1, 4, 7, 6]
for df, ls in zip(deg_of_freedom, linestyles):
ax.plot(x, stats.chi2.pdf(x, df), linestyle=ls)
plt.xlim(0, 10)
plt.ylim(0, 0.4)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Chi-Square Distribution')
plt.legend()
plt.show()

Result

Data Distribution in Python

Evaluating our model

To evaluate whether our model is good enough or not, we will use a method called Train/Test.

What is Train/Test?

It is a method for calculating the accuracy of our model.

The name suggests Train/Test because we will split the data set into two sets: a training set and a testing set.

Train model is used to create the model and Test model test the accuracy of that model.

Applications of data distribution

  1. One of the best and most used applications is the data distribution method will organize the raw data into a graphical method like histograms, box plots, run charts etc. and will provide useful information.
  2. The main application is that it estimates the probability of any specific observation in a sample space.



ADVERTISEMENT
ADVERTISEMENT