Adv Topics: Graph dataset partitioning in Cloud Computing

When dealing with cloud environments, the network management systems are key to processing big data, because some computational components within the network can have bandwidth unevenness (Conolly & Begg, 2014; Sakr, 2014). Chen et al. (2016) stated that often the cloud infrastructure is designed as interconnected computational resources in a tree structure, where they are first group in pods, which are then connected to other pods This unevenness can be defined as differing network performance between computational components within and across different racks from each other (Lublinsky, Smith, & Yakubovich, 2013). This unevenness grows as the cloud gets upgrade and the computational components grow to become increasingly heterogeneous with time (Chen et al., 2016). Thus, network management systems, which is part of resource pooling in cloud environments, helps with data load balancing, route connections, and diagnose problems between the “computational node ó link ó computational node ó link ó computational node …;” where each node is connected to multiple databases through multiple links (Conolly & Begg, 2014).

Network bandwidth unevenness becomes an issue when trying to process graph and network data, because (a) the complexity of the data cannot be stored in typical relational database systems; (b) the data is complex to process; and (c) scalability issues (Chen et al., 2016). Sakr (2014) stated there are two ways to scale data: vertical scaling where researchers are finding the few computational components that can handle huge data loads or horizontal scaling where researchers are spreading the data across multiple computational components. Chen et al. (2016) and Sakr (2014) recommended these two large-scale graph dataset partitioning methods for cloud computing systems:

  • Machine Graph: Models the graphical data by treating each computational node as a graphical vertex, such that the interconnectivity between each computational node would represent the connection of that graphical data element to others graphical data elements. A benefit of using this design is that the computational nodes representing this graphical data doesn’t need the knowledge of the topology of the network and can be built by utilizing the network bandwidth between computational nodes.
  • Partition Sketch: A graphical data set is partitioned out to represent multiple tree-structured data, and each computational node address a single level of the graphical tree data. From the perspective of the partitioned tree; the root node holds the input graph, non-leaf nodes hold partition iteration data, and leaf nodes hold the partitioned graph data points. This is built iteratively, with the goal of reducing the number of cross leaf-nodes between partitions. Design wise, each partition sketch is locally optimized, addresses monotonicity (cross leaf-nodes between partitions and non-leaf node depth), and addresses the proximity of parent nodes by defining common parent data between leaf nodes and non-leaf nodes. For instance, data that has a low common parent data should be stored together in high bandwidth computational nodes, whereas leaf-node data should be stored in low bandwidth computational nodes. However, the limitation of such a design is that the partition size must be carefully selected to fit monotonicity, but at the same time not have too many nodes as to run into input and output errors when trying to retrieve the data. Thus, more processing nodes should be used to decrease the number of cross leaf-nodes with low bandwidth, but not too many to cause memory issues.

Resources

  • Chen, R. Weng, X., He, B., Choi, B. and Yang, M. (2016). Network performance aware graph partitioning for large graph processing systems in the cloud. Retrieved from http://www.comp.nus.edu.sg/~hebs/pub/rishannetwork_crc14.pdf
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. Pearson Learning Solutions. VitalBook file.
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox, VitalBook file.
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.

Adv Topics: MapReduce and Hadoop

Hadoop allows for data processing through MapReduce and it also allows for data storage (Lublinsky et al., 2014). MapReduce is an analytical engine and pattern that takes advantage of distributed systems while keeping the processes and data in one machine (Sadalage & Fowler, 2012). MapReduce thus contains two functions that work in parallel on distributed systems (Hortonworks, 2013; Sadalage & Fowler, 2012; Sakr, 2014; Sathupadi, 2010):

    1. Mappers functions create and process transactions on the system by mapping and aggregating data by key values. Mappers can read only one data record at a time.
    2. Reducers functions know what that key values are and will take all those values stored in a map to reduce the data to what is relevant. Reducers help summarize the data into a single output. This helps deal with the amount of data moving between multiple computational nodes.

Lublinsky, Smith, and Yakubovich, (2014), stated that an intermediate component of MapReduce is known as the shuffle and sort, where the data from the mapping function outputs are moved and presented to the reducer function.

Thus, MapReduce is a framework that uses parallel sequential algorithms that capitalize on cloud architecture, which became popular under the open source Hadoop project, as its main executable analytic engine (Lublinsky et al., 2014; Sadalage & Fowler, 2012; Sakr, 2014). Essentially, a sequential algorithm is a computer program that runs on a sequence of commands, and a parallel algorithm runs a set of sequential commands over separate computational cores (Brookshear & Brylow, 2014; Sakr, 2014). Thus, a parallel sequential algorithm runs a full sequential program over multiple but separate cores (Sakr, 2014). Another feature of MapReduce is that a reduced output can become another’s map function (Sadalage & Fowler, 2012). Subsequently, the advantages and disadvantages of using MapReduce are (Lusblinksy et al., 2014; Sakr, 2014):

+ aggregation techniques under the mapper function can exploit multiple different techniques

+ no read or write of intermediate data, thus preserving the input data

+ no need to serialize or de-serialize code in either memory or processing

+ it is scalable based on the size of data and resources needed for processing the data

+ isolation of the sequential program from data distribution, scheduling, and fault tolerance

– too many mapper functions can create an infrastructure overhead, which increases resources and thus cost

– too few mapper functions can create huge workloads for certain types of computational nodes

– too many reducers can provide too many outputs, and too little reducers can provide too little outputs

 – it’s a different programming paradigm that most programmers are not familiar with

 – the use of available parallelism will be underutilized for smaller data sets

Given that Hadoop is predominately known for popularizing MapReduce tasks, it is also known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems (Rathbone, 2013). Hadoop’s service is part of the cloud (as Platform as a Service = PaaS).  For PaaS, the end users manage the applications and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001). Data is broken up into small blocks, like Legos, such that they are distributed across a distributed database system and across multiple servers and can be processed across all these servers, e.g. Hadoop Cluster (IBM, n.d.).

A common example of a parallel sequential program is dynamical weather forecasting models. In dynamical weather forecasting models, there is a set of defined geodynamic, thermodynamic, and physical sequential algorithms define and evolve the main seven variables of weathers across time. For each time step, the forecasting models run these sequential algorithms over each grid point, which can represent a finite geospatial region. Each of these geospatial regions is split amongst multiple computational scores. This example expands in complexity when data has to travel between different finite geospatial regions through the boundaries, which is an example of data parallelism (Sakr, 2014). MapReduce uses the concept of data parallelism to help map and reduce data. Therefore, weather models could be considered as a loose form of MapReduce algorithm.

Resources:

Health care as a Service (HaaS) – cloud solution

Health cloud-healthcare as a service (HaaS) Case Study (John & Shenoy, 2014):

The goal of this study is to provide the framework to build Health Cloud, a healthcare system that helps solve some of the issues currently dealt with in the Healthcare data analytics field. Especially, when paper images and data are limiting to only that healthcare provider’s facility until it is faxed, scanned, or mailed. The Health Cloud will be able to: store and index medical data, image processing, report generating, charting, trend analysis, and be secured with identification and access control.  The image processing capabilities of Health Cloud enable for better medical condition diagnosis of a patient.  The image processing structure was built using C++ Code for processing, to request data and to report out is done in Binary JSON (BSON) or text formats. Finally, the system built allows for the image to be framed, visualized, panned, zoomed, and annotated.

Issues related to health care data on the cloud (John & Shenoy, 2014):

  1. The number of MIR data has doubled in a decade, and CT data has increased by 50%, increasing the number of images primary providers are requesting on their patient to improve and create informed patient care. Thus, there is a need for hyper-scale cloud features.
  2. Health Insurance Portability and Accountability Act (HIPAA) requires data to be stored for six years after a patient has been discharged therefore increasing the volume of data. Consequently, there is another need for hyper-scale cloud features.
  3. Healthcare data should be able to be sharing medical data from anywhere and at any time per the Health Information Technology for Economic and Clinical Heath Act (HITECH) and American Recover and Reinvestment Act (ARRA), which aim to reduce duplication of data and improve data quality and access. HIPAA has created security regulations on data backup, recovery, and access. Hence, there is a need to have a community cloud provider familiar with HIPAA and other Regulations.
  4. Each hospital system is developed in silos or purchased from different suppliers. Thus, if data is shared, it may not be in the format that is easily received by the other Thus a common architecture and data model must be developed.  This can be resolved under a community cloud.
  5. Creation of seamless access to the data stored in the cloud among various mobile platforms. Thus, a cloud provided option such as a Software as a Service may best fit this requirement.
  6. Healthcare workflows are better managed in cloud-based solutions versus paper-based
  7. Cloud capabilities can be used for processing data, depending on what is purchased from which supplier.

Pros and Cons of healthcare data on the public or private cloud:

On-site private clouds can have limited storage space, and the data may not be in a format that is easily transferable to other on-site private clouds (Bhokare et al., 2016). Upgrades, maintenance, and infrastructure costs fall 100% of the health care providers.  Although these clouds are expensive, they offer the most control of their data and more control over-specialization of reports.

Public clouds distribute the cost of the upgrades, maintenance, and infrastructure to all others requesting the servers (Connolly & Begg, 2014). However, the servers may not be specialized 100% to all regulatory and legal specifications, or the servers could have additional regulatory and legal specification not advantageous to the healthcare cloud system.  Also, data stored on public clouds are shared with other companies, which can leave healthcare providers feeling vulnerable with their data’s security within the public cloud (Sumana & Biswal, 2016).

The solution should be a private or public community cloud.  A community cloud environment is a cloud that is shared exclusively by a set of companies that share the similar characteristics, compliance, security, jurisdiction, etc. (Connolly & Begg, 2014). Thus, the infrastructure of all of these servers and grids meet industry standards and best practices, with the shared cost of the infrastructure is maintained by the community. Certain community services would be optimized for HIPAA, HITECH, ARRA, etc. with little overhead to the individual IT teams that make up the overall community (John & Shenoy, 2014).

Reference

  • Bhokare, P., Bhagwat, P., Bhise, P., Lalwani, V., & Mahajan, M. R. (2016). Private Cloud using GlusterFS and Docker.International Journal of Engineering Science5016.
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th). Pearson Learning Solutions. [Bookshelf Online].
  • John, N., & Shenoy, S. (2014). Health cloud-Healthcare as a service (HaaS). InAdvances in Computing, Communications and Informatics (ICACCI, 2014 International Conference on (pp. 1963-1966). IEEE.
  • Sumana, P., & Biswal, B. K. (2016). Secure Privacy Protected Data Sharing Between Groups in Public Cloud.International Journal of Engineering Science3285.

Data Tools: Hadoop Basic Componets & Architecture

Big Data

Big data can be defined as any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do, 2013; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  What is considered to be big data can change with respect to time.  What is considered as big data in 2002 is not considered big data in 2016 due to advancements made in technology over time (Fox & Do, 2013).  However, given that big data today is too big to be processed just by using one processor, the use of parallel processing allows for data analytics to be conducted through platforms like Hadoop more efficiently (Hortonworks, 2013; IBM, n.d.).

Hadoop: Basic Components and Architecture

Hadoop’s service is part of cloud (as Platform as a Service = PaaS).  For PaaS, the end users manage the applications and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001).

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). Data is broken up into small blocks, like Legos, such that they are distributed across a distributed database system and across multiple servers (IBM, n.d.).  Just like Legos, the end the results can be assembled back.  This feature of HDFS allows for Hadoop to manage big data through parallel processing and analysis (Gary et al., 2005, Hortonworks, 2013; IBM, n.d.).  Multiple data types are supported through the HFDS (IBM, n.d.) For Hadoop’s MapReduce function, it can be broken down into two queries.

Parallel processing is key for Hadoop, because it allows for making quick work on a big data set, because rather than having one processor doing all the work, Hadoop splits up the task amongst many processors. One of MapReduce’s main two queries is that it splits the data into the Lego pieces and places them across a group of computer nodes in the HDFS called the mapping procedure (Eini, 2010; IBM, n.d; Hortonworks, 2013; Sathupadi, 2010). The second MapReduce query applied algorithms to reduce the data in each of the computer nodes equally to answer the question that was asked of the data; such that at the end of the parallel processing procedures, the reduced data gets combined and further reduced to provide the final answer (Eini, 2010; IBM, n.d; Hortonworks, 2013; Minelli et al., 2013; Sathupadi, 2010). In other words, data is partitioned, sorted and grouped to provide a key and value as an output (Hortonworks, 2013; Rathbone, 2013; Sathupadi, 2010). Therefore, IBM’s (n.d.) MapReduce functions use the HFDS to house the data and MapReduce runs its procedures on the server in which the data is stored.  Data is stored in a memory, not in cache and allow for continuous service (Gu & Li, 2013; Zaharia et al., 2012).

Given the Lego blocks feature in the HDFS, which allows for MapReduce functions, these blocks can contain a subset of data, which are small enough that they can be easily duplicated (for disaster recovery purposes) in two or more different servers (IBM, n.d.).  This partitioning of the data into data Lego blocks allows for big iterative tasks to be done quite easily and efficiently for big data sets (Gu & Li, 2013).

When to use Hadoop

Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized. Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Also, Hadoop fails in providing a real-time response (Greer, Rodriguez-Martinez, & Seguel, 2010).  Therefore, for big data that isn’t streaming real-time data and has a ton of iterative processing/analytical tasks Hadoop should be used.

Preparation of Big Data for Hadoop

Collecting the raw and unaltered real world data is usually the first step of any data or text mining study (Coralles et al., 2015; Gera & Goel, 2015; He et al., 2013; Hoonlor, 2011; Nassirtoussi et al., 2014). Next, the data must be preprocessed, because raw text data files are unsuitable for predictive data analytics tools like Hadoop (Hoonlor, 2011). Barak and Modarres (2015) and Nassirtoussi et al. (2014), all stated that in both data and text mining, data preprocessing has the most significant impact on the research results.  Wayner (2013) and Lublinksy, Smith, and Yakubovich (2013), enumerated the following tools used to preprocess data prior to data analysis with Hadoop as part of the core components of the ecosystem:

  • Ambari: Graphical User Interface for setting up clusters with common components. Essentially a simple management tool.
  • Avro: serialization systems that compiles all the data together into a XML or JSON output to be shared with others.
  • BigTop: tool that provides testing of sub-projects within Hadoop.
  • Clouds: Allows the end-user to spin up multiple nodes to process the data without necessarily owning the infrastructure, essentially pay as you go model
  • Flume: Gathers all data and places it into HDFS. Essentially an enterprise data integration tool.
  • GIS tools: allows end-users to work with big data stored as geographic maps under GIS (Geographic Information Systems) formats.
  • HBase: helps search and share a big tabular data set, unfortunate full ACID is not available. Essentially a NoSQL Database.
  • HDFS: Storage of big data in multiple distributed systems into data blocks. Essentially a Distributed reliable data storage.
  • Hive: SQL type language that files and pulls out data that is needed from HBase. Essentially a high-level abstraction tool.
  • Lucene: indexes large blocks of unstructured text based data and allows for dynamic clustering and ability to read XML
  • Mahout: Allows for Hadoop to use classification, filtering, k-means, Dirichelet, parallel pattern, and Bayesian classification similar to Hadoops MapReduce. Essentially a data analytics library.
  • NoSQL: Uses NoSQL data stores for data that is not typically stored in HBase or HDFS.
  • Oozie: manages the workflow of a job by allowing the user to break the job into simple steps in a flowchart fashion. Essentially a workflow manager.
  • Pig: stores and maps data in processing nodes for Hadoop to find and process. Essentially a high-level abstraction tool.
  • Spark: uses Hadoop infrastructure to store data in the cache to allow for faster processing time
  • SQL on Hadoop: ad-hoc query the data stored in Hadoop servers using SQL
  • Sqoop: stores data in SQL databases into Hadoop. Essentially an enterprise data integration tool.
  • Whirr: Library that allows to run Hadoop clusters on Amazon EC2, Rackspace, etc.
  • ZooKeeper: maintains order and synchronization throughout the parallel processing cluster. Essentially a coordinator of processes.

According to Lublinksy et al. (2013), there are always new datasets, data formats, and data preprocessing and processing tools being added to Hadoop.  Thus the list provided above is not a comprehensive list, but rather one to begin off from.

Reference

  • Barak, S., & Modarres, M. (2015). Developing an approach to evaluate stocks by forecasting effective features with data mining methods. Expert Systems with Applications, 42(3), 1325–1339. http://doi.org/10.1016/j.eswa.2014.09.026
  • Corrales, D. C., Ledezma, A., & Corrales, J. C. (2015). A Conceptual Framework for Data Quality in Knowledge Discovery Tasks (FDQ-KDT): A Proposal. Journal of Computers, V10(6), 396-405. Doi: 10.17706/jcp.10.6.396-405.
  • Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Gera, M., & Goel, S. (2015). Data Mining-Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Greer, M., Rodriguez-Martinez, M., & Seguel, J. (2010). Open Source Cloud Computing Tools: A Case Study with a Weather Application.Florida: IEEE Open Source Cloud Computing.
  • Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J., & Zients, J. (2014). Big Data: Seizing Opportunities. Executive Office of the President of USA, 1–79.
  • Gray, J., Liu, D. T., Nieto-Santisteban, M., Szalay, A., DeWitt, D. J., & Heber, G. (2005). Scientific data management in the coming decade. ACM SIGMOD Record, 34(4), 34-41.
  • Gu, L., & Li, H. (2013). Memory or time: Performance evaluation for iterative operation on hadoop and spark. InHigh Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on (pp. 721-727). IEEE.
  • Eini, O. (2010). Map/Reduce- a visual explanation. Retrieved from https://ayende.com/blog/4435/map-reduce-a-visual-explanation
  • He, W., Zha, S., & Li, L. (2013). Social media competitive analysis and text mining: A case study in the pizza industry. International Journal of Information Management, 33, 464–472. http://doi.org/10.1016/j.ijinfomgt.2013.01.001
  • Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
  • Hortonworks (2013). Introduction to MapReduce. Retrieved from https://www.youtube.com/watch?v=ht3dNvdNDzI
  • IBM (n.d.) What is the Hadoop Distributed File System (HDFS)? Retrieved from https://www-01.ibm.com/software/data/infosphere/hadoop/hdfs/
  • Lau, W. (2001). A Comprehensive Introduction to Cloud Computing. Retrieved from https://www.simple-talk.com/cloud/development/a-comprehensive-introduction-to-cloud-computing/
  • Lublinsky, B., Smith, K., Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox, VitalBook file.
  • Minelli, M., Chambers, M., Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses (1st). VitalSource Bookshelf Online.
  • Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2014). Text mining for market prediction: a systematic review. Expert Systems with Applications41(16), 7653–7670. http://doi.org/10.1016/j.eswa.2014.06.009
  • Rathbone, M. (2013). A beginners guide to Hadoop. Retrieved from http://blog.matthewrathbone.com/2013/04/17/what-is-hadoop.html
  • Sathupadi, K. (2010) Map Reduce: A really simple introduction. Retrieved from http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/

 

Big Data Analytics: Cloud Computing

Clouds come in three different privacy flavors: Public (all customers and companies share the all same resources), Private (only one group of clients or company can use a particular cloud resources), and Hybrid (some aspects of the cloud are public while others are private depending on the data sensitivity.

Cloud technology encompasses Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).  These types of cloud differ in what the company managers with respect to what is managed by the cloud provider.  For IaaS the company manages the applications, data, runtime, and middleware, whereas the provider administers the O/S, virtualization, servers, storage, and networking.  For PaaS the company manages the applications, and data, whereas the vendor, administers the runtime, middleware, O/S, virtualization, servers, storage, and networking.  Finally SaaS the provider manages it all: application, data, O/S, virtualization, servers, storage, and networking (Lau, 2011).  This differs from the conventional data centers where the company managed it all: application, data, O/S, virtualization, servers, storage, and networking.

Examples of IaaS are Amazon Web Services, Rack Space, and VMware vCloud.  Examples of PaaS are Google App Engine, Windows Azure Platform, and force.com. Examples of SaaS are Gmail, Office 365, and Google Docs (Lau, 2011).

There are benefits of cloud is this pay-as-you-go business model.  One, the company can pay for as much (SaaS) or as little (IaaS) of the service that they need and how much in space they require. Two, the company can go on an On-Demand model, which businesses can scale up and down as they need (Dikaiakos, Katsaros, Mehra, Pallis, & Vakali, 2009).  For example, if a company would like a development environment for 3 weeks, they can build it up in the cloud for that time period and spend money for using the service for 3 weeks rather than buying a new set of infrastructure and setting up all the libraries.  This can help speed up the development speed in a ton of applications moving forward when you elect the cloud versus buying a new infrastructure.  These models are like renting a car.  Renting a car for what you need, but you are paying for what you use (Lau, 2011).

Replacing Conventional Data Center?

Infrastructure costs are really high.  For a company to be spending that much money on something that will get outdated in 18 months (Moore’s law of technology), it’s just a constant sink in money.  Outsourcing, infrastructure is the first step of company’s movement into the cloud.  However, companies need to understand the different privacy flavors well, because if data is stored in a public cloud, it will be hard to destroy the hardware, because you will destroy not only your data, but other people’s and company’s data.  Private clouds are best for government agencies which may need or require physical destruction of the hardware.  Government agencies may even use hybrid structures, keeping private data in the private clouds and the public stuff in a public cloud.  Companies that contract with the government could migrate to hybrid clouds in the future, and businesses without contracts with the government could go onto a public cloud.  There may always be a need to store the data on a private server, like patents, of KFC’s 7 herbs and spices recipe, but for the majority of the data, personally the cloud may be a grand place to store and work off of.

Note: Companies that do venture into moving into a cloud platform and storing data, they should focus on migrating data and data dictionaries slowly and with uniformity.  Data variables should have the same naming convention, one definition, a list of who is responsible for the data, meta-data, etc.  This would be a great chance for companies, while in migration to a new infrastructure to clean up their data.

Resources: