Statistics Tutorial for Data Science

Statistics is a science of collecting, organizing, presenting, analyzing, and interpreting data to help in making a more effective decision.

Before starting with this tutorial, you should have a basic understanding of Mathematics.

Why is Statistics Important?

Statistics is important because today we live in the information world, and much of the information is determined mathematically with the help of Statistics.

Example:

  • The researcher uses statistical concepts. They use their statistical skills for collecting the appropriate data.
  •  In the field of finance, Statistics are the key to how traders and people in business invest their money.
  • We use Statistics concepts for quality testing. Companies make many products, and every company should make sure that they sell the best quality of products but companies do not test all the products, so they use statistics.
  • Statistics allow us to collect information around the world.
  • Statistics play a vital role in the field of medical science. Before any medicine prescribed, the scientist must show a statistically valid rate of effectiveness.

Types of Statistics

There are two types of statistics:

  1. Descriptive Statistics
  2. Inferential Statistics

Descriptive Statistics: Descriptive statistics is a method of organizing, summarizing, and presenting the data in an informative way.

Inferential Statistics: Inferential statistics is a method used in determining something about a population based on a sample.

  • Population: Population is the collection of all individuals or items under consideration in a statistical study. It can be finite or infinite. 

Example:                            

            Students of javatpoint.

All stars in the universe.

  • Sample: Sample is a subset of the population.

Example:

 Students who study java at javatpoint.

1200 voters selected at random for interview.

Introduction to Basic Terms:

  • Variable: Characteristic or condition that has different values for a different individual element of a population.

There are two types of variable in statistics:

  • Qualitative Variable: Qualitative ->Quality

It deals with descriptions that means data that can be observed but not measured.

Example:

Colors, smells, etc.  

  • Quantitative Variable: Quantitative -> Quantity

It deals with numbers that mean data that can be measured.

Example:

Length, Area, etc.

  • Data (Singular): It may be a number, a word, or a symbol.
    • Data (Plural): It is the measurement and observation of a variable.
    • Statistics: It is a numerical value that summarizes the sample data.

Arithmetic Mean: The arithmetic mean is calculated by adding up all the values and dividing those values to the total numbers of values.

The arithmetic mean is represents byx? and read it as x bar.

There are three types of series to calculate the arithmetic mean:

  • Individual Data Series
  • Discrete Data Series
  • Continuous Data Series

Individual Data Series: In this series, data is given on individual basis.

 Where

x? = the mean.

x1 = the first value.

x2 = the second value.

x3 = the third value.

xn = the last value.

N= Sum of all values.

Example

Problem Statement:

Calculate the arithmetic mean for following individual data series given below:

    Items 10 25 15       30         20 50

Solution:

The given data, we have:

                                    10, 25, 15, 30, 20, 50.

static5

 =25

The arithmetic mean of the given numbers is 25.

Discrete Data Series: In this series, data is giving along with their frequencies.

We can write same formula as follows:

Where

N  = Numbers of observations.

ƒ1 , ƒ2 , ƒ3 ,...., ƒn = Different values of frequency.

x1 , x2 , x3 ,...., xn = Different values of variable .

Example

Problem Statement:

Calculate the arithmetic mean for following discrete data given below:

  Items (?)      10      12      16     20     14     15
Frequency (ƒ)      3       5      1      0       2      6

Solution: The given data, we have:

     Items(?)         Frequency(ƒ)     ƒ?
10 3 30
12 5 60
16 1 16
20 0 0
14 2 28
15 6 90
  N = 17 ?ƒ?= 224

= 13.17

The arithmetic mean of the given numbers is 13.17.

Continuous Data Series:In this series, the data is given based on ranges along with their frequencies.

In the case of continuous series, mid-point (m) calculated as static10

The arithmetic mean of continuous data series can be calculated using formula -

static11

We can write the same formula as follows:

Where

                   N = Number of Observation.

ƒ1 , ƒ2 , ƒ3 ,...., ƒn = Different values of frequency.

m1 , m2 , m3 .... ... ... mn = Different values of mid-points.

Example

Problem Statement:

 Calculate the arithmetic mean of the following continuous data given below:

     Items    0 - 10   10 - 20   20 - 30   30 - 40    40 – 50
 Frequency       2         4        1        3         5

Solution:

The given data, we have:

     Items    Mid-point(m)  Frequency(ƒ) ƒm
      0 - 10            5            2        10
    10 - 20           15            4        60
    20 - 30           25            1        25
   30 - 40           35            3        105
    40 - 50           45            5        225
    ?ƒ= 15 ?ƒm = 425
static13

= 28.33

The arithmetic mean of the given numbers is 28.33.

Arithmetic Median

The arithmetic median is the middle value in a given set of values. To find the median arrange the numbers in ascending order, and then choose the value in the middle.

The arithmetic median is represented by "M."

There are three types of series to calculate the arithmetic median:

  • Individual Data Series
  • Discrete Data Series
  • Continuous Data Series

Individual Data Series: In this series data is given on individual basis.

Median for even number of series:

Where –

               N = Number of Observation.

Median for the odd number of series:

M = Value of the middle number.

Example 1:

Problem Statement:

Calculate the arithmetic median for following individual data given below:

Items      12      18       20       24       30       32

Solution:

static15

= Value of 3.5th number

static17

The Arithmetic Median of given numbers is 22.

Example 2:

Problem Statement:

Calculate the arithmetic median for following individual data given below:

Items      12     18       20       24       32       40       48

Solution:

Given series contain 7 numbers, an odd number thus the middle

Number in this series is Arithmetic Median, i.e., 24.

Discrete Data Series: In this series, data is giving along with their frequencies.

Example 1:

Case 1: For even series-

Problem Statement:

Calculate the arithmetic median for following discrete data given below:

     Items    12      18       20       24       30       32
  Frequency     3        5        2         1         4          6

Solution:

= Value of 3.5th number

static19

The Arithmetic Median of given numbers is 22.

Example 2:

Case 2: For odd series-

Problem Statement:

Calculate the arithmetic median for following discrete data given below:

Items 121820 24 32  40 48
Frequency        5       3         2         1         4         9         6

Solution:

Given series contain 7 numbers, an odd number thus the middle number in this series is Arithmetic Median, i.e., 24.

Continuous Data Series:In this series, the data is given based on ranges along with their frequencies.

The arithmetic median of continuous data series can be calculated using formula -

Where

c.ƒ = Previous commutative frequency of median class.

ƒ = frequency of median class.

i  = interval of the median class.

              n = total number of observation.

Example

Problem Statement:

Calculate the arithmetic median for following continuous data given below:

   Items  0-10     10-20     20-30     30-40     40-50     50-60   60-70
Frequency    7       18       34       50        35       20       6

Solution:

        Items     Frequency
         0-10 7 7
        10-20           18 25
        20-30           34 59
        30-40           50 109
        40-50           35 144
        50-60           20 164
        60-70             6 170
    ?ƒ = 170  

L = 30

c.f. = 59

f = 50

i = 10

static24
static25
static26

= 35.2

The Arithmetic Median of given numbers is 35.2.

Arithmetic Mode

The arithmetic mode is a value that occurs most frequently.

The arithmetic mode is represents by "M0 ".

There are three types of series to calculate arithmetic mode:

  • Individual Data Series
  • Discrete Data Series
  • Continuous Data Series

Individual Data Series: In this series, data is giving on an individual basis.

Example

Problem Statement:

Calculate the arithmetic mode for following individual data given below:

Items      12    28    25     20     128     45      20

Solution:

The Arithmetic Mode of the given numbers is 20 as it repeated the maximum number of times, i.e., 2.

Discrete Data Series: In this series, data is given along with their frequencies. In discrete series, the arithmetic Mode is calculate by finding the variable which has the highest frequency associated with it.

Example:

Problem Statement:

Calculate the arithmetic mode for following discrete data given below:

Items     15       120        117        78        95         25
Frequency      12        15         10       9        14        11

Solution:

The Arithmetic Mode of the given numbers is 120 as the highest frequency, 15 associated with it.

Continuous Data Series: In this series, the data is given based on ranges along with their frequencies.

Arithmetic mode of continuous data series can be calculated using formula –

static27

Where –

M0 = Arithmetic Mode.

L = Lower limit of modal class.

ƒ0 = Frequency of class preceding modal class.

 ƒ1 = Frequency of modal class

ƒ2 = Frequency of class succeeding modal class.

  i  = Class Interval.

Model class: It is the class that has the highest frequency.

Example

Problem Statement:

Calculate the arithmetic mode for following continuous data given below:

      Items   0-10   10-20   20-30    30-40  40-50   50-60
  Frequency     5     8      6      10      4        1

Solution:

L= 30

ƒ0 = 6

ƒ1 = 10

ƒ2 = 4

i = 6

static28

                                           = 30 + 2.4

                                           = 32.4

Arithmetic mode of given numbers is 32.4.

Measures of Dispersion: It is information on the spread and variability of the data value.

   The commonly used dispersion measures include:

  • Range
  • Variance and standard deviation

 Range: The range is the difference between the largest and the smallest observation in the sample.

The following formula gives the arithmetic range:

                                                  Range = L - S

Where –

L  = Largest item

            S = Smallest item

Co-efficient of Range: Co-efficient of a range is given by-

static30

Example

Problem Statement:

What is the range and range coefficient for the following set of numbers?

     20      40      10       45      15      65       55       90

Solution:

Sort the numbers in ascending order.

     10       15       20       40       45        55       65        90

            Largest Number = 90

            Smallest Number = 10

            Range = 90-10

                       = 80

             Largest + Smallest = 90 + 10 = 100

static31

Sample variance: It is the arithmetic mean of squared deviations from the sample mean.

Let x1 , x2 , x3 ... ... xn random sample from a population. Than sample variance is:

Where

S2= sample variance.

xi = ith element of the sample.

x? = mean of the sample.

n =sample size.

Example

Problem Statement:

Find the sample variance of the following numbers given below.

 Numbers           4           10          12          15            9

Solution:

x?  = 10

xi (xi - x? ) (xi - x? )2
        4 (4 - 10) =-6          36
       10 (10 - 10) = 0          00
       12 (12 - 10) = 2          04
       15 (15 - 10) = 5          25
        9 (9 - 10) = -1          01
    ?(xi - x? )2 =66

= 16.5

The sample variance is 16.5.

Standard Deviation: It is the square root of sample variance denoted by S.

(S) ?S2=

Where

(S2) = Standard variance.

Example:

Problem Statement:

Consider the above problem statement where the standard variance is 16.5. Then calculate the value of the standard deviation.

Solution:

S = ? 16.5

                          = 4.06

The standard deviance is 4.06.

Difference between Population and Sample:

static37

Probability: It is the chance of something happening. Its range lies between [0-1].

Event (E) : Number of trial events can happen.

Sample Space (S) : Total numbers of trials.

static38

Example

Problem Statement:

A coin tossed 10 times. Heads occur 4 times, and tail occurs 6 times. What is the probability of getting a head?

Solution:

Number of trials in which event happens = 4

        Total number of trials = 10

static39

Probability - Additive Theorem

Mutually Exclusive Event: Event in which only one outcome can happen at any one time.

Example 1: A coin can be tossed either head or tail occur, not both

Mathematically, it is calculated using formula-

P ( A or B) = P(A) + P(B)

  P ( A ? B) = P(A) + P(B)

Example 2

Problem Statement:

What is the probability of drawing an ace or a king from a standard deck of cards?

Solution:

P (ace or king) = P(ace) + P(king)

Non-Mutually Exclusive Event: Event in which more than one outcome can happen at any one time.

Example 1: A die rolled, and a number 6 occur then, it could be both even and a multiple of three.

   Mathematically, it is calculated using formula-

P ( A or B) = P(A) + P(B) - P(A and B)

P ( A ? B) = P(A) + P(B) - P(A ? B)

Example 2

Problem Statement:

When drawing a card from a deck of 52 cards. What is the probability of getting a red card or a queen?

Solution:

 P = (Red or queen) = P(Red) + P(queen) - P (Red ? queen)

static45

Probability – Multiplicative Theorem

Independent Event: Two events were said to be independent of each other when the probability that one event happens does not affect the probability of the other event happen.

Example 1: Tossing a coin and rolling a number cube is an independent event.

Mathematically, it is calculated using formula-

P (A and B) = P(A) x P(B)

P (A ? B) = P(A) x P(B)

Example 2

Problem Statement:

A coin tossed twice, what is the probability of getting a head and then a tail?

Solution:

P ( H1 ? T2) = P(H1) x P(T2)

static47

Dependent Event: Two events are said to be dependent on each other when the probability that one event happens, affects the probability of the other event occurs. 

Example 1: Taking out a ball from a bag containing some balls and not replacing it, and then taking out a second ball.

Mathematically, it is calculated using formula-

P (A and B) = P(A) x P(B | A)

Where

Bayes Theorem: Bayes Theorem finds the probability of an event based on the other things that we know.

Bayes Theorem is:

Where A1, A2...Ai… are set of n mutually exclusive events.

Example

Problem Statement:

 Bag X contains 4 white and 6 black balls while another Bag Y contains 4 white and 3 black balls. If one ball is drowned random from one of the bags, and it is found to be black. Find the probability that ball is drawn from Bag X.

Solution: 

E1 = Event of choosing the bag X.

E2 = Event of choosing the bag Y.

                  A =Event of choosing a black ball.

static50
static51
static52

            Bayes' theorem. –

Static53
static54