YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark

Background:

The frequent itemset mining (FIM) is one of the most important techniques to extract knowledge from data in many real-world applications. The Apriori algorithm is the widely-used algorithm for mining frequent itemsets from a transactional dataset. However, the FIM process is both data-intensive and computing-intensive. On one side, large scale data sets are usually adopted in data mining nowadays; on the other side, in order to generate valid information, the algorithm needs to scan the datasets iteratively for many times. These make the FIM algorithm very time-consuming over big data. The existing parallel Apriori algorithms implemented with the MapReduce model are not efficient enough for iterative computation.

Significance:
YAFIM (Yet Another Frequent Itemset Mining), a parallel Apriori algorithm based on the Spark RDD framework—a specially-designed in-memory parallel computing model to support iterative algorithms and interactive data mining.

Our proposed algorithm takes the advantage of the Spark RDD model.
We adopt an advanced feature called broadcast variables abstraction in the Spark, which allows programmers to send a piece of shared data to each slave only once.

Experimental results show that, compared with the algorithms implemented with MapReduce, YAFIM achieved ×18 speedup in average for various benchmarks.