Cluster Analysis in Data Mining

Cluster Analysis in Data Mining

What is meant by cluster analysis?

Cluster analysis in data mining refers to the process of searching the group of objects that are similar to one and other in a group. Those objects are different from the other groups.

The first step in the process is the partition of the data set into groups using the similarity in the data.

The advantage of Clustering over classification is that they are prone to the changes, and it can be easily adapted to those changes. This helps in selecting or picking helpful features in different groups.

Applications of Cluster Analysis

  1. Cluster analysis is very popular in applications such as data analysis, image processing, pattern recognition, market research and others.
  2. Marketers can find a specific group of customers and, they can categorize those groups of customers according to their patterns or trends of purchasing.
  3. When it comes to biology, it is very helpful in characterizing genes according to the similarities, and thus, it helps in delivering plants and animals taxonomies. This can also be used for getting proper insight into the structures of various organisms.
  4. Clustering's most valuable role is the detection of an outlier. For example, it can be used for detecting frauds in credit cards.
  5. In the field of data mining, with the help of cluster analysis, the experts can gain insight into the distribution of data. This can be used for observing the characteristics of every cluster.
  6. It is also very helpful in identifying certain areas that have similar land use in the database of earth's observation.

Advantages of Cluster Analysis:

  1. It is a cheap option as it helps to cut down the cost of preparing the sampling frame or any other administrative factors.
  2. There is no need for special scales of measurement.
  3. With the help of visual graphics, one can have a clear understanding and comprehension of different clusters.

Disadvantages of Cluster Analysis:

  1. The main point of disadvantage is that the cluster formed are usually not on the basis of any theoretical part. The clusters are rather formed at random.
  2. Moreover, in a few cases, the process of determining these clusters is very difficult in order to come to a decision.

Cluster analysis can be divided into two main categories, which are as followed:

  1. Hierarchical Clustering
  2. Non- Hierarchical Clustering

Hierarchical Clustering:

Hierarchical Clustering creates a hierarchy of objects that represents the "tree of similarities."

There are various advantages as well, such as it does not require the knowledge of the number of clusters present.

And the disadvantage is that it depends on the scale of data.

There are two types of approaches used in this kind of Clustering, which are Agglomerative and Divisive.

Agglomerative Clustering:

It is an algorithm that involves the set of clusters that are formed with the help of individual objects.

In R (Library Cluster):

 Agnes (x, diss, metric, stand, method, ......)

Where the terms refer to:

x- Data frame of real vectors of features or a matrix of dissimilarities.

diss- Is x a dissimilarity matrix? (True or False).

Metric- Metric used (euclidean, Manhattan).

Stand- Is data Standardize? (True or False).

Method- The method of measuring the distance of the cluster used.

Agglomerative Clustering can be further categorized in different variants such as:

  1. Single linkage
  2. Complete linkage
  3. Average linkage
  4. Ward's distance

Divisive Clustering:

This algorithm creates a single cluster that consists of all the objects. The divisive clustering algorithm refers to a top-down clustering approach; initially, all the points in the dataset belong to one cluster, and a split is performed repetitively as one of them moves down the hierarchy.

In R (Library Cluster):

Diana (x, diss, metric, stand, …...)

Where the terms refer to:

x- Data frame of real vectors of features or a matrix of dissimilarities.

diss- Is x a dissimilarity matrix? (True or False).

Metric- Metric used (euclidean, Manhattan).

Stand- Is data Standardize? (True or False).

Non- Hierarchical Clustering:

K- means Clustering:

It is easy to understand and easy to implement the approach in Non- Hierarchical Clustering. It is also fast and appropriate when there is a finite number of steps. However, it also has some disadvantages, such as if there are different initial Clustering, then there will be a difference in final Clustering. Moreover, the resulting Clustering depends on the units of measurement. So, if the variables have different nature or they are somewhat different with respect to the magnitude then, it must be standardized at all costs.

K- means can be easily understood on the basis of the following library stats:

In library stats:

K- means (x, centers, iter.max, nstart, algorithm)

Where the different terms refer to:

X- data frame of real vectors of features.

Centers- it is the number of centers clusters.

Iter.max- it is the maximum number of iterations.

Nstart-  it is the number of restarts.

Algorithm- it is the method that was used (it could be Llyod, Mac Owen, etc.)

K- medoids clustering:

The literal meaning of medoids is the most central object. In the case of Clustering, it means the most central objects of every cluster. Just like K- means, it is easy to understand and easy to implement the approach in Clustering. It is also a very fast and convergent approach, when there is a finite number of steps. It is also less sensitive to outliers than K- means, which is one of its advantages. However, K- medoids also have some disadvantages, such as the different initial sets of medoids can be a reason for different final Clustering. Moreover, the resulting Clustering depends on the units of measurement, and if the variables have different nature, then it should be standardized.

In the library cluster:

Pam (x, k, diss, metric, medoids, stand,…..)

Where the terms stand for:

X- data frame of real vectors of features or a matrix of dissimilarities.

K- it is the number of clusters.

Diss- is x a dissimilarity matrix? (true or false).

Metric- what was the metric used? (I.e., Euclidean, Manhattan)

Medoids- vectors of initial medoids.

Stand- Standardize data or not? (true or false).

Model-based Clustering:

The basic idea behind this approach is that it initially finds the "most probable" assignment of objects to clusters and positions of centers and, finally, the covariance matrices that represent the shape of the cluster. The advantages of this approach over the K- means and K- medoids are that in this Model-based Clustering, elliptic clusters can be found, but in K- means and K- medoids, only the spherical clusters are formed. There is no need for the scale of variables; as a result, it is not dependent on that. However, there are several disadvantages, as well. First and foremost is that it is very difficult to understand, and on top of that, in Model-based Clustering, unlike K- means, one can never use dissimilarities.

In R (library Mclust):

Mclust (data, modelNames,…..)

Where the terms refer to:

Data- data frame of real vectors of this feature.

Modelnames- which were the models used? (it could be EII, VII, EVI, VVI, EEE, EEV, and so on.)