Adv Topics: Graph dataset partitioning in Cloud Computing

When dealing with cloud environments, the network management systems are key to processing big data, because some computational components within the network can have bandwidth unevenness (Conolly & Begg, 2014; Sakr, 2014). Chen et al. (2016) stated that often the cloud infrastructure is designed as interconnected computational resources in a tree structure, where they are first group in pods, which are then connected to other pods This unevenness can be defined as differing network performance between computational components within and across different racks from each other (Lublinsky, Smith, & Yakubovich, 2013). This unevenness grows as the cloud gets upgrade and the computational components grow to become increasingly heterogeneous with time (Chen et al., 2016). Thus, network management systems, which is part of resource pooling in cloud environments, helps with data load balancing, route connections, and diagnose problems between the “computational node ó link ó computational node ó link ó computational node …;” where each node is connected to multiple databases through multiple links (Conolly & Begg, 2014).

Network bandwidth unevenness becomes an issue when trying to process graph and network data, because (a) the complexity of the data cannot be stored in typical relational database systems; (b) the data is complex to process; and (c) scalability issues (Chen et al., 2016). Sakr (2014) stated there are two ways to scale data: vertical scaling where researchers are finding the few computational components that can handle huge data loads or horizontal scaling where researchers are spreading the data across multiple computational components. Chen et al. (2016) and Sakr (2014) recommended these two large-scale graph dataset partitioning methods for cloud computing systems:

  • Machine Graph: Models the graphical data by treating each computational node as a graphical vertex, such that the interconnectivity between each computational node would represent the connection of that graphical data element to others graphical data elements. A benefit of using this design is that the computational nodes representing this graphical data doesn’t need the knowledge of the topology of the network and can be built by utilizing the network bandwidth between computational nodes.
  • Partition Sketch: A graphical data set is partitioned out to represent multiple tree-structured data, and each computational node address a single level of the graphical tree data. From the perspective of the partitioned tree; the root node holds the input graph, non-leaf nodes hold partition iteration data, and leaf nodes hold the partitioned graph data points. This is built iteratively, with the goal of reducing the number of cross leaf-nodes between partitions. Design wise, each partition sketch is locally optimized, addresses monotonicity (cross leaf-nodes between partitions and non-leaf node depth), and addresses the proximity of parent nodes by defining common parent data between leaf nodes and non-leaf nodes. For instance, data that has a low common parent data should be stored together in high bandwidth computational nodes, whereas leaf-node data should be stored in low bandwidth computational nodes. However, the limitation of such a design is that the partition size must be carefully selected to fit monotonicity, but at the same time not have too many nodes as to run into input and output errors when trying to retrieve the data. Thus, more processing nodes should be used to decrease the number of cross leaf-nodes with low bandwidth, but not too many to cause memory issues.

Resources

  • Chen, R. Weng, X., He, B., Choi, B. and Yang, M. (2016). Network performance aware graph partitioning for large graph processing systems in the cloud. Retrieved from http://www.comp.nus.edu.sg/~hebs/pub/rishannetwork_crc14.pdf
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. Pearson Learning Solutions. VitalBook file.
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox, VitalBook file.
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.