YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark

YAFIM
Background:

The frequent itemset mining (FIM) is one of the most important techniques to extract knowledge from data in many real-world applications. The Apriori algorithm is the widely-used algorithm for mining frequent itemsets from a transactional dataset. However, the FIM process is both data-intensive and computing-intensive. On one side, large scale data sets are usually adopted in data mining nowadays; on the other side, in order to generate valid information, the algorithm needs to scan the datasets iteratively for many times. These make the FIM algorithm very time-consuming over big data. The existing parallel Apriori algorithms implemented with the MapReduce model are not efficient enough for iterative computation.

Significance:
YAFIM (Yet Another Frequent Itemset Mining), a parallel Apriori algorithm based on the Spark RDD framework—a specially-designed in-memory parallel computing model to support iterative algorithms and interactive data mining.
  • Our proposed algorithm takes the advantage of the Spark RDD model.
  • We adopt an advanced feature called broadcast variables abstraction in the Spark, which allows programmers to send a piece of shared data to each slave only once.

Experimental results show that, compared with the algorithms implemented with MapReduce, YAFIM achieved ×18 speedup in average for various benchmarks.