Data Analytics MCQ

1. Which among the following option is a type of Linear Regression?

  1. Simple Linear Regression (S.L.R.)
  2. Multiple Linear Regression (M.L.R.)
  3. Both the given options are the type of linear regression (S.L.R. and M.L.R.)
  4. None of the above given answers is/are correct.

Answer: C. Both the given options are the type of linear regression

Explanation: There are two types of linear regression, having the names Multiple Linear Regression (M.L.R.) and Simple Linear Regression(S.L.R.). Multiple Linear Regression (M.L.R.) is a representation having only 1 dependent variable, and there can be more than 1 variables which are not dependent. While in the Simple Linear Regression model, there is only one dependent variable as well as only one independent variable.

2. Which among the following options is not a frequently used method to measure central tendency in data analytics?

  1. Median
  2. Variance/(S.D. ^2)
  3. Mean/Average
  4. Mode

Answer: B. Variance

Explanation: Here, Median, mean, and mode are the three methods to calculate central tendency. Mean is the average of different data values given. Mode is the data element having the highest frequency. The median is the central data in the sorted data format. Variance is not one of the measures of central tendency, but it is a measure of dispersion. Standard deviation is also the method to calculate dispersion of data.

3. Why data normalization is used in data analytics? Mark the most appropriate reason.

  1. To decrease the redundancy in the data
  2. To take out the data point which is missing
  3. To remove outliers (extreme cases) from the data
  4. To systematize the wide range of values in the data

Answer: D. To systematize the wide range of values in the data

Explanation: You should not confuse this normalization with the normalization of database management systems. In data analytics, data normalization is used to make the range of values in the data standard. This process will make it easier to compare different data variables and ensures the equal contribution of each variable in the analysis.

4. Which option among the following options is not a data visualization tool commonly used in data analytics?

  1. Decision tree
  2. Bar chart
  3. Scatter plot
  4. Box plot

Answer: A. Decision tree

Explanation: Bar chart, Scatter plot, and Box plot are the data visualization techniques used in data analytics. But, the decision tree is not a commonly used data visualization technique in data analytics; it is a popular machine learning (ML) algorithm used for regression and classification. Don’t confuse machine learning algorithms with data visualization tools used in data analytics.

5. Which of the following statistical tests is/are used to determine if any two groups of data is/are significantly different from each other?

  1. T-test
  2. Chi Square test
  3. ANOVA (Analysis of Variance)
  4. Pearson correlation

Answer: B. T-test

Explanation: T-test is the test used to determine whether two groups of data is/are significantly different from each other. The Chi square test is used for categorical data, while the ANOVA technique is used for three or more groups. Pearson correlation is used for checking the strength of the relationship between two continuous variables. Hence option B, T-test, is the correct option.

6. Which of the following algorithms is used commonly for classifying data in machine learning?

  1. Linear Regression
  2. Decision tree
  3. Principal component analysis
  4. K means

Answer: B. Decision tree

Explanation: As per all the options, the decision tree is the most commonly used classification in machine learning. Linear regression is used for regression analysis; Principal component analysis is used in reducing dimensions of a given data set, and K means is used in the classification of data. All these algorithms are used in unsupervised learning in machine learning. Hence the option B decision tree is the correct answer.

7. In the given options, which of the algorithms is/are supervised learning algorithms used for classification?

  1. Principal component analysis (PCA)
  2. K means
  3. Random forest
  4. Naïve Bayes

Answer: D. Naïve Bayes

Explanation: As per all the options, Naïve Bayes is the supervised learning algorithm. In Naïve Bayes classification, data is classified based on the MAP criterion. It is totally a classification based on the probability of various events. PCA (principal component analysis) and K means are the types of unsupervised learning algorithms. Random forest is an unsupervised learning algorithm used for both regression and classification of data. Hence out of the given options, option D naïve Bayes, is the correct answer.

8. In which data measurement scale there exist a true zero among the following options?

  1. Ordinal measurement scale
  2. Nominal measurement scale
  3. Ratio measurement scale
  4. Interval measurement scale

Answer: C. Ratio measurement scale

Explanation: In data Analytics, there are four types of data measuring scales. They can be identified as ratio scale having intervals with true zero, interval scale having intervals of data, ordinal scale having ordered valued, and nominal scale having no ordered values. Ratio scale is the scale having data including the true zero, while there is no existence of a true zero point in any other scale of measurement. Nominal scales are used for categorical data; ordinal scales are used for data having natural order, and interval scale is used when data is having equal intervals in between the values. Hence option C ratio measurement scale is the correct answer.

9. What is the purpose behind using data sampling in the field of data analytics?

  1. To increase the size of the dataset for a more accurate analysis of the data
  2. Used for the pre-processing and cleaning of the data before making something meaningful out of it.
  3. To obtain a small, representable sample of data from a larger set of data for observing purposes.
  4. To remove the outliers occurring in the set of data.

Answer: C. To obtain a small, representable sample of data from a larger dataset for analysis purposes.

Explanation: In data analytics, data sampling is used for the purpose of selecting a smaller, representative subset of the larger dataset for analysis. Selecting a smaller subset of the dataset saves time and resources while the integrity of the data remains the same. Hence option C, to obtain a small, representable sample of data from a larger dataset for analysis purposes, is the most suitable answer to this question.

10. Which statistical measure is used to define the central tendency of a set of data?

  1. Standard deviation (S.D.)
  2. Variance (Square of S.D.)
  3. Mean (Average)
  4. Range

Answer: B. Mean (Average)

Explanation: Here, the correct option is mean. The mean is the only statistical measure in the given options, which is used to measure the central tendency of a set of data. Mean is summation dividing number of data values. Here, Variance, Range, and Standard deviation are all measures of the dispersion of the dataset. Standard deviation measures how much is the data spread about the average or any other point. Hence option B, mean, is the appropriate answer.

11. Which of the following technique of data visualization is used to display the classification/distribution of a dataset?

  1. Bar plot
  2. Scatter plot
  3. Box plot
  4. Pie chart

Answer: B. Box plot

Explanation: Here, the correct option is the Box plot. A box is the technique of data visualization used to display the classification/distribution of a set of data. It shows outliers of data, median, and quartiles, making the data useful for identifying skewness or outliers in the data.

12. Tata Motors wants to examine its sales data to figure out trends and patterns in sales. They have collected data on the number of products sold, the price of each product, and the day of the week the product was sold. Which of the statistical methods would be most appropriate for this analysis?

  1. T-test
  2. Regression analysis
  3. Factor analysis
  4. Cluster analysis

Answer: B. Regression analysis

Explanation: Here, the correct option is the Regression analysis. To examine the sales data to find trends and patterns, regression analysis is the best statistical method. It can identify the relationship between the independent variables (the price of each product and the day on which it is sold) and the dependent variable (the number of products sold). The regression analysis has many other properties also, which makes it suitable for the analysis of the sales data.

13. Hyundai Motors wants to predict its sales for the upcoming quarter based on the previous quarter’s data. The linear regression equation for the data is y = 10 + 3 * x, where x is the time period, and y is the sales. If Hyundai Motors wants to predict its sales for the next time period (x = 4), what are the predicted values of sales for Hyundai Motors?

  1. 24
  2. 17
  3. 12
  4. 15

Answer: C. 22

Explanation: Here, the correct option is 14. In order to predict the sales for the next time period (x = 4), we substitute x = 4 into the regression equation. y = 10 + 3(4) = 22. Therefore the predicted value of the sales is 22.

14. The standard deviation of a given set of numbers is 10, and the mean/average of the same data is 50. What is the value of z (z = (x – mean) / S. D.) for a data point of 60?

  1. 1.6
  2. 1.0
  3. 2.0
  4. 2.2

Answer: B. 1.0

Explanation: The value of z can be calculated with the help of the formula given as - z = (x – u) / s, where u = 50 is the mean of the given data, and s = 10 is the standard deviation of the data. Putting the values u = 50 and s = 10 in the equation z = (x – u) / s, we will get z = (60 – 50) / 10 = 1.0. Hence option B is the correct answer.

15. Tesla Motors tested its website traffic and found that the number of unique visitors each day follows Poisson distribution with a mean of 50. What is the probability that more than or equal to 40 unique customers will be there on a particular day?

  1. 0.963
  2. 0.183
  3. 0.235
  4. 0.65

Answer: B. 0.183

Explanation: The formula for Poisson distribution is given by P(x = k) = (e ^ - ? * ? ^ k) / k!, Here k = 0, 1, 2, 3, 4, 5… ? is the average / mean of the distribution. For at most 40 customers, we have to sum the probability for 40 events. P(X <= 40) = P(X = 0) + P(X = 1) +, +P(X = 40) = S (0, 40) e ^ -50 * 50 ^ k. which can be calculated as 0.183. You cannot calculate it manually. You have to use a computer program to calculate it.

16. If a given set of numbers has a skewness of about 1.8, then what does this tell about the set of data?

  1. The given set of data is positively skewed
  2. The given set of data is negatively skewed
  3. The dataset given is normally distributed
  4. The given set of data is symmetrically distributed

Answer: A. The given set of data is positively skewed

Explanation: The correct answer to this question is positively skewed. As skewness measures that how asymmetric the distribution is about the average / mean. Positively skewed data means the distribution has a tail on the positive side of the distribution. Negatively skewed data means the distribution has a tail on the negative side of the distribution. Symmetric distribution is none of positive or negatively skewed. As the question has a skewness factor greater than one, the set of data is positively skewed.

17. Correlation between any of the two variables can be measured/expressed with which of the following variable?

  1. Mean / Average difference
  2. Covariance
  3. Standard deviation (S.D.)
  4. Variance (S. D. ^ 2)

Answer: A. Covariance

Explanation: The correct answer to this question is Covariance. The correlation between any two variables can be expressed with the help of covariance. It is measured by the summation of the multiplication of the deviation of each term from its mean.

18. Suppose a set of data has an S.D. (Standard deviation) equal to 4. Now, with this calculate the coefficient of variation if the mean/average is 20.

  1. 0.8
  2. 0.2
  3. 0.25
  4. 0.05

Answer: B. 0.2

Explanation: The coefficient of variation (C.V.) can be calculated by dividing the standard deviation by the mean. It measures the relative variations of data. Here, C.V. = S.D. / mean = 4 / 20 = 0.2. In percentage, it is 0.2 * 100 = 20 %. But here, only the decimal equivalent is being asked, which is equal to 0.2. Hence option B 0.2 is the correct answer to this question.

19. Tata Motors private limited is conducting A/B testing to compare two different website layouts. The company, at random assigns, 1000 visitors to each of the two layouts and measures the number of clicks on a particular button. Which of the given tests should be used to know if there is a significant difference in click-through rates?

  1. Chi square test
  2. Z test
  3. ANOVA
  4. Student’s t test

Answer: D. Student’s t test

Explanation: Student’s t test is used when we have to compare the means / averages of two different groups. In this question, Tata Motors private limited is comparing the click through rates of two totally different website layouts. Since in question, we have continuous data and the sample size is also large enough, so here we can apply a student’s t test. Hence option D is the correct answer.

20. Suppose there are two variables having a covariance equal to 10, and the standard deviation of the two different variables is 2 and 5, respectively. What is the value of the correlation coefficient between these two variables?

  1. 2.5
  2. 0.2
  3. 1.0
  4. 5

Answer: C. 1.0

Explanation: The correlation coefficient in this question can be calculated using the ratio of covariance and the product of the standard deviation of these two variables. It is defined as the measure of the strength of the linear relationship between these two variables. Here the covariance = 10, and S.D. (standard deviation) of the first variable = 2. The second variables is having an S.D. (Standard deviation) of 5. There coefficient of correlation = 10 / (2 * 5) = 1.

21. A set of data has a mean equal to 50 and a standard deviation (S. D.) of 10. What is the ninety five percent (95%)confidence interval for the mean of the set of data?

  1. (40, 60)
  2. (47.5, 52.5)
  3. (45, 55)
  4. (48.5, 51.5)

Answer: B. (47.5, 52.5)

Explanation: The formula to calculate the confidence interval of the mean is given as: Confidence interval = mean +- (z value) * (standard deviation) / sqrt (sample size). For a 95% confidence interval, the value of z is 1.96. Now putting the value of z in the equation, we get, Confidence interval = 50 + - (1.96) * (10) / sqrt (n). We don’t know the sample size here. Hence we have to estimate the sample size here. Let’s assume that the sample size is large enough.We will then get confidence interval = (47.5, 52.5).

22. Which among the following machine learning algorithms is used for unsupervised learning?

  1. Decision tree
  2. Log transformation
  3. Min max scaling
  4. K-means clustering

Answer: D. K means clustering

Explanation: K-means clustering is used in machine learning as unsupervised learning, which means that it can be used to find relations and links in data without training on a labeled set of data. The decision tree is used as a supervised learning algorithm, while Log transformation and min max scaling are not machine learning algorithms. Hence, option D is the correct answer.

23. Which among the following pattern recognition algorithms is used for dimensionality reduction of the given dataset?

  1. Decision tree
  2. Principal Component Analysis (P.C.A.)
  3. Min max scaling
  4. K-means clustering

Answer: B. Principal Component Analysis (P.C.A.)

Explanation: Here, dimensionality reduction means reducing the number of features associated with the dataset. P.C.A. (Principal Component Analysis) can be used to change high dimensional data to low dimensional data. But while reducing the dimensions, the information should be retained. Hence, option B, Principal Component Analysis (P.C.A), is the correct answer. This technique is mostly used in feature extraction and visualizing the data.

24. For the linear regression equation y = 2 * x + 1, what is the coefficient of determination (R-squared) based on the following data points: (1, 3), (2, 5), (3, 7), (4, 9), (5, 11)?

  1. 0.8
  2. 1
  3. 0.45
  4. 0.3

Answer: B. 1

Explanation: The coefficient of determination (R-squared) checks the fitness of the linear regression model that how well it fits the data given. The value of the R-squared varies from 0 to 1. Here 0 means that the linear regression model doesn’t fit the data given, while 1 means the given linear regression model perfectly fits the given data. In order to calculate R-squared value, we first need to calculate the predicted values of Y based on the linear regression model given by equation Y = 2 * x + 1. Then we will calculate the SSE (Sum of the square of errors), which is given by the summation of the square of the difference of predicted values and actual values of Y. Then, at the end, we will calculate the R-squared value using the given formula: R-squared value = 1 – (SSE / SST), here SST is the summation of square of the difference between mean/average of Y and the actual values of Y. As the predicted Y is given by Y = [3, 5, 7, 9, 11], SSE = [(3 - 3) ^ 2 + (5 – 5) ^ 2 + (7 – 7) ^ 2 + (9 – 9) ^ 2 + (11 – 11) ^ 2] = 0, and SST = [(3 - 7) ^ 2 + (5 – 7) ^ 2 + (7 – 7) ^ 2 + (9 – 7) ^ 2 + (11 – 7) ^ 2] = 40. There the R-squared value is given by 1 – (0 / 40) = 1. Hence the R-squared value is 1, which means the linear regression model Y = 2 * x + 1 is the perfect fit for the given data.

25. For the given numbers: 5, 10, 12, 15, 18, 20, 25, and 30, what is the IQR (Interquartile range)?

  1. 13
  2. 11.5
  3. 15
  4. 20

Answer: A. 11.5

Explanation: In order to calculate IQR (Interquartile range), we first need to find the median of the given numbers. To find the median of the data 5, 10, 12, 15, 18, 20, 25, 30. The given data in increasing order is 5, 10, 12, 15, 18, 20, 25, and 30. Hence the median is (15 + 18) / 2 = 16.5.

The first quartile (Q1):- The lower half of the data set consists of the numbers 5, 10, 12, and 15. The median is (10 + 12) / 2 = 11. Hence Q1 is 11.

The third quartile (Q3):- The upper half of the data set consists of the numbers 18, 20, 25, and 30. The median is (20 + 25) / 2 = 22.5. Hence the Interquartile range (IQR) is calculated as the difference of Q3 and Q1. IQR = Q3 – Q1 = 22.5 – 11 = 11.5.

26. There are two given datasets, X = [1, 2, 3, 4, 5] and Y = [2, 4, 6, 8, 10]. What is the correlation coefficient between these two datasets?

  1. 0
  2. 1
  3. -1
  4. 0.5

Answer: B. 1

Explanation: The correlation coefficient measures how strong the relation is between two variables. The value varies from -1 to 1. Here -1 is the negative correlation, 0 indicates no correlation, and 1 indicates strong positive relation. To find the correlation between X and Y, find the mean and (S.D.) standard deviation of both the given sets of data. The mean of X is 3, and mean of Y is 6. The standard deviation of X is approximately 1.58, and the standard deviation of Y is approximately 3.16. Then using the formula to calculate the correlation coefficient: Correlation coefficient = Σ[(Xi - Xmean) * (Yi - Ymean)] / [(n-1) * Sx * Sy]. Hence plugging in all the values, we get the correlation coefficient(r). r = [(1-3)(2-6) + (2-3)(4-6) + (3-3)(6-6) + (4-3)(8-6) + (5-3)(10-6)] / (41.58*3.16) = 1. Hence the given correlation is perfectly positive for these given datasets X and Y.