Association Rule in Data Mining

Association Rule in Data Mining

What is meant by Association Rule?

One may understand Association Rule as if-then statements. It is generally used for finding and obtaining frequent patterns, correlation, and association data sets. To put it in layman's language, association rules analysis is a technique that is used to figure out how different items in a data set are associated with one and the other.

Types of Association Rules Learning

Association rule can be divided into three main types of Algorithm:

  • Apriori Algorithm
  • Eclat Algorithm
  • F-P Growth Algorithm

Apriori Algorithm:

It is simply worked for generating patterns by pairing the items into singletons, pairs, and triplets. It was given by Mr. R. Agarwal and Mr. R. Srikant in the year 1994. It was introduced mainly for the purpose of finding frequent itemsets in a data set. There is a property called Apriori Property, which is basically used for improving the efficiency of a frequent itemset. This property also helps in reducing the search space. 

Apriori Algorithm works on a few assumptions, which have been listed below:

  • All the subsets of a frequent item sets must be frequent (this is also what Apriori Property is).
  • If an itemset is infrequent, all its supersets will be infrequent.

The main advantage of the Apriori Algorithm is that it is used for large data sets, unlike the Eclat Algorithm (that will be discussed in the next section), which is better suited for small and medium data sets. On the other hand, there are a number of drawbacks also such as it is slow, and it is not considered an efficient approach for large data sets.

Eclat Algorithm:

Eclat algorithm is short for equivalence class transformation. It is one of the most popular and important methods of Association Rule. It is regarded as a very efficient and scalable version of the Apriori Algorithm. This Algorithm is used for finding and obtaining frequent itemset in a database or a transaction. To summarize, it basically means that Eclat Algorithm is used to obtain or generate frequent itemsets in a database. Eclat algorithm has various advantages as well such as it is faster if it is compared to Apriori Algorithm, it is best suited for small and medium data sets, and one of the best features of Eclat Algorithm is that it scans the currently generated data sets, unlike Apriori Algorithm which scans the original data sets.

F-P Growth Algorithm:

F-P Growth Algorithm is short for frequent pattern growth algorithm. It is known for representing the Database in the form of a tree structure, which is called a frequent pattern or tree. There are many advantages of using the FP Growth Algorithm, such as a faster method than the Apriori method or Eclat method. The Database is stored in a compact version in memory, and on top of that, it is an efficient and scalable way for mining frequent patterns. 

However, it also has some disadvantages, such as it is difficult for a user to build a Frequent Pattern tree than using Apriori Algorithm, and it may also be expensive. 

Applications of Association Rule:

  • Catalog Design
  • Cross Marketing
  • Market Basket Data Analysis
  • Medical Diagnosis
  • Protein Sequence
  • Data Analysis
  • Classification
  • Clustering
  • Loss Leader Analysis

Methods to measure Association:

In Association Rule, there are mainly three simple methods to measure Association:

Support:

Consider a transaction list, as shown below:

Transaction 1ABCD
Transaction 2ABC 
Transaction 3AC  
Transaction 4BD  
Transaction 5ACD 
Transaction 6CD  
Transaction 7BCDE
Transaction 8AEC 
Transaction 9CDE 

Now, in this example, as it is evident, the Support of, let's say, item 'A' will be given as

Support{A} = 5/9

The main job of the Support is to reveal the popularity and importance of a certain itemset.

Confidence:

It is mainly responsible for displaying transactions in which there are one or more items purchased one after the other.

To put it in simple and precise words, confidence tells us how many chances of purchasing item 'B' are after that item 'A' is purchased.

Taking an example from the Table given in the example of Support, as shown above, it can be concluded that

Confidence {A ---> B} = Support {A, B}/Support {A}

One limitation of using confidence measure is that it does not give us any information about the importance of the second Association. For example: in the above example, the importance of A is only shown, and the importance of popularity of item 'B' is misinterpreted in an association.

Lift:

Lift is basically responsible for displaying the user how many chances of purchasing item 'B' are after that item 'A' is purchased. At the same time, controlling the popularity or importance of item 'B.'

For example, coming to the Table in the example given in the Table of Support.

Lift {A ---> B} = Support {A, B}/Support{A}*Support{B}

If the lift is smaller than 1, then item 'B' (let's say) is not very likely to be bought after item 'A' is purchased.

Similarly, if the lift is greater than 1, then the item is likely to be bought after item ‘A’ is purchased.

However, there’s also a third case in which the lift is exactly 1. In that case, it is implied that there is no association between the two items.

Uses of Association Rules in Data Mining:

As it has already been observed that Association Rules play a very big role in Data Mining. It plays a very crucial role in customer analytics, catalog design, cross-marketing, market basket data analysis, product clustering and many more.

It is very evident that many programmers use the association rule to create programs capable of Machine Learning.