ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

Implementation and performance analysis of dynamic partitioning of graphs in Apache Spark

Journal: International Journal of Advanced Computer Research (IJACR) (Vol.10, No. 48)

Publication Date:

Authors : ; ;

Page : 116-127

Keywords : Partition; Graphs; RDD; Clustering; Spark.;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

In recent years data is growing continuously at an exponential rate. Even in its processed form, large data is difficult to understand and analyze. One of the best approaches to handle large data is to represent it in the graphic format. These graphs can be used for data analysis, processing and decision making. Because of the volume of data is huge, the graphs generated by them are also huge making it difficult to process. Thus, the modular approach of studying them by partitioning them is much more effective. There are several inbuilt partitioning techniques available in Apache Spark which can be extended to graph partitioning according to the user needs. Graph partitioning is an active area of research with considerable activity with an aim towards increasing the accuracy and speed of the algorithms. This research work aims to build a high- performance graph partitioning technique for Apache Spark which makes it faster, scalable and efficient. In this paper, a custom algorithm that can divide the graph into an optimal number of partitions dynamically in Apache Spark is proposed. A novel method to calculate the distance between the nodes and similarity indexes to partition the data is introduced. The optimal number of partitions is decided to use similarity indexes, which are calculated using the concept of Laplacian matrix and Eigenvalues. The proposed algorithm is implemented and its performance is compared to the existing algorithms in Apache Spark. The results indicate that the algorithm partitions graphs which have a huge number of vertices with considerable efficiency and computing cost.

Last modified: 2020-06-18 18:31:25