What is clustering?
Clustering is an unsupervised learning method which works on unlabeled data. It helps in grouping the data into different clusters. Objects with similar properties are grouped in one cluster and thus have no similarity with other clusters. Different clustering methods exist, like DBSCAN, OPTICS, CURE, etc.
What is OPTICS Clustering?
OPTICS Clustering is referred to as Ordering Points to Identify Cluster Structure. It is a density-based clustering unsupervised learning algorithm developed after the DBSCAN (Density-based Spatial Clustering of Applications with Noise) algorithm. The OPTICS clustering algorithm expands on the DBSCAN clustering concepts by two more terms.
The OPTICS clustering is similar to the DBSCAN clustering algorithm but can extract clusters of various densities and forms. It helps locate clusters with various densities in big, high-dimensional datasets.
As the name suggests, the OPTICS clustering algorithm's main objective is to extract the clustering structure of the dataset by locating the densely related points. An ordered list of points is created by this algorithm, namely the reachable plot, used to construct a density-based data representation. The reachability distance, a measurement of how simple it is to reach from one point to another, is connected to each entry in the list. The points with similar reachability distances are grouped in the same cluster.
Approach to the OPTICS Clustering Algorithm
- Step 1: The first step towards clustering using OPTICS is to create a density threshold parameter, Eps, used to regulate the minimum density of the clusters.
- Step 2: The next step is to calculate the distance of each point in the dataset to its k-nearest neighbors.
- Step 3: The calculation of the reachability point is the next step. Determine the reachability distance of each location in the dataset based on the density of its neighbors, starting at any point.
- Step 4: Create the reachability plot after ordering the points according to their reachability points.
- Step 5: Using the reachability plot, identify clusters by combining points that are close to one another and have a similar range of reachability.
The OPTICS clustering algorithm is implemented using the sklearn package of Python. The sklearn library provides a class sklearn.cluster.OPTICS. It requires several parameters, including a reachability distance cutoff, a minimum density threshold (Eps), and the number of nearest neighbours to consider.
Working of OPTICS
The OPTICS clustering does not divide the data into clusters physically. It uses the visualisation of reachability distance to cluster the data. These are a few concepts that are an add-on of the DBSCAN clustering:
- Core Distance: A given point's smallest radius must be designated as a core point, referred to as core distance. The supplied point's Core Distance is undefined if it is not a Core point.
- Reachability Distance: The Reachability distance can be referred to as the maximum of the Core distance of a point, say p and the Euclidean distance between the point p and some other point, say q. The point q must be a Core Point to determine the Reachability Distance.
Even though the MinPts parameter is used in these computations, the theory is that it would have little effect because all distances would scale nearly at the same pace.
Firstly, the core distance is calculated on all data points in the set. Then, the reachability distance is updated after observing the complete data set and processing the data points. The points we process are set in order, and the reachability distance is updated simultaneously. Further, those points are decided that have minimum reachability distance. Thus, in this way, this algorithm forms clusters and keeps them to maintain order. Then, the labels of the actual cluster are extracted from the plot. We can do this by searching “valleys” in the plot using maximum and minimum.
Detecting Outliers using OPTICS
The OPTICS clustering algorithm is used for detecting outliers. An outlier is a data point that does not fit with the normal distribution of the dataset. The OPTICS algorithm provides an extension for detecting the outliers, namely OPTICS-OF. The OF in OPTICS OF stands for Outlier Factor. It gives an outlier score to every point by comparing them with the closest neighbours instead of the entire cluster. Its “local” principle makes it a unique outlier detection process.
The Outlier Factor creates a new measure, “Local Reachability Density”. It is the opposite of the average reachability of the MinPts-neighbors about the point you are calculating.
After calculating the local reachability point for each point, now the Outlier Factor is calculated. It can be calculated by taking the average of the ratios of the MinPts-neighbors to a specific point.
The “local” element of the Outlier factor of the OPTICS separates it from the other outlier detection methods. In addition to providing a binary value, it can also assign a relative outlier score.
Benefits of OPTICS
- It gives the clustering structure of the data set.
- It forms the reachability plot.
- There is no requirement to fix the number of clusters in advance.
- It can identify hierarchical structures and handle clusters of various densities and forms.
OPTICS Clustering is referred to as Ordering Points To Identify Cluster Structure. It is a density-based clustering unsupervised learning algorithm developed after the DBSCAN (Density-based Spatial Clustering of Applications with Noise) algorithm. It needs more memory to find the next point closest to the reachability distance. It can identify hierarchical structures and handle clusters of various densities and forms. A reachability distance plot generated by OPTICS can be used to extract clusters at various granularities. In the reachability distance plot, small clusters surrounded by noise points may be blended with those clusters, making it harder for OPTICS to distinguish them. The OPTICS has a higher runtime complexity as it uses a priority queue to maintain reachability distances.