Pros and Cons of Hadoop MapReduce

The are some of the advantages and disadvantages of using MapReduce are (Lusblinksy et al., 2014; Sakr, 2014):

Advantages 

  • Hadoop is ideal because it is a highly scalable platform that is cost-effective for many businesses.
  • It supports huge computations, particularly in parallel execution.
  • It isolates low-level applications such as fault-tolerance, scheduling, and data distribution.
  • It supports parallelism for program execution.
  • It allows easier fault tolerance.
  • Has a highly scalable redundant array of independent nodes
  • It has a cheap unreliable computer or commodity hardware.
  • Aggregation techniques under the mapper function can exploit multiple different techniques
  • No read or write of intermediate data, thus preserving the input data
  • No need to serialize or de-serialize code in either memory or processing
  • It is scalable based on the size of data and resources needed for processing the data
  • Isolation of the sequential program from data distribution, scheduling, and fault tolerance

Disadvantages 

  • The product is not ideal for real-time process data. During the map phase, the process creates too many keys, which consume sorting time. 
  • Most of the MapReduce outputs are merged.
  • MapReduce cannot use natural indices.
  • It is a must to buffer all the records for a particular join from the input relations in repartition join.
  • Users of the MapReduce framework use textual formats that are inefficient.
  • There is a huge waste of CPU resources, network bandwidth, and I/O since data must be reprocessed and loaded at every iteration.
  • The common framework of MapReduce doesn’t support applications designed for iterative data analysis.
  • When a fixed point is reached, detection may be the termination condition that calls for more MapReduce job that incurs overhead.
  • The framework of MapReduce doesn’t allow building one task from multiple data sets.
  • Too many mapper functions can create an infrastructure overhead, which increases resources and thus cost 
  • Too few mapper functions can create huge workloads for certain types of computational nodes
  • Too many reducers can provide too many outputs, and too few reducers can provide too few outputs
  • It’s a different programming paradigm that most programmers are not familiar with
  • The use of available parallelism will be underutilized for smaller data sets

Resources

  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.

Adv DBs: A possible future project?

Below is a possible future research paper on a database related subject.

Title: Using MapReduce to aid in clinical test utilization patterns in the medicine

The motivation:

Efficient processing and analysis of clinical data could aid in better clinical tests on patients, and MapReduce solutions allow for an integrated solution in the medical field, which aids in saving resources when it comes to moving data in and out of storage.

The problem statement (symptom and root cause)

The rates of Sexually Transmitted Infections (STIs) are increasing at alarming rates, could the addition of Roper Saint Francis Clinical Network in the South test utilization patterns into Hadoop with MapReduce reveal patterns in the current STIs population and predict areas where an outbreak may be imminent?

The hypothesis statement (propose a solution and address the root cause)

H0: Data mining in Hadoop with MapReduce will not be able to identify any meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

H1: Data mining in Hadoop with MapReduce can identify a meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

The research questions

Could this study apply to STIs outbreaks rates be generalized into other disease outbreak rates?

Is this application of data-mining in Hadoop with MapReduce the correct way to analyze the data?

The professional significance statement (new contribution to the body of knowledge)

Identifying where an outbreak of any disease (or STIs), via clinical tests utilization patterns has yet to be done according to Mohammed et al (2014), and they have stated that Hadoop with MapReduce is a great tool for clinical work because it has been adopted in similar fields of medicine like bioinformatics.

Resources

  • Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce programming framework to clinical big data analysis: Current landscape and future trends. Biodata Mining, 7. doi:http://dx.doi.org/10.1186/1756-0381-7-22 – Doctoral Library Advanced Technologies & Aerospace CollectionPokorny, J. (2011).
  • NoSQL databases: A step to database scalability in web environment. In iiWAS ’11 Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services (pp. 278-283). – Doctoral Library ACM Digital Library

Parallel Programming: Synchronized Objects

Sanden (2011) shows how to use synchronized objects (concurrency in Java), which is a “safe” object, that are protected by locks in critical synchronized methods.  Through Java we can create threads by: (1) extend class Thread or (2) implement the interface Runnable.  The latter defines the code of a thread under a method: void run ( ), and the thread completes its execution when it reaches the end of the method (which is essentially a subroutine in FORTRAN).  Using the former you need the contractors public Thread ( ) and public Thread (Runnable runObject) along with methods like public start ( ).

Additional Examples:

MapReduce

According to Hortonworks (2013), MapReduce’s Process in a high level is: Input -> Map -> Shuffle and Sort -> Reduce -> Output.

Tasks:  Mappers, create and process transactions on a data set filed away in a distributed system and places the wanted data on a map/aggregate with a certain key.  Reducers will know what the key values are, and will take all the values stored in a similar map but in different nodes on a cluster (per the distributed system) from the mapper to reduce the amount of data that is relevant (Hortonworks, 2013). Reducers can work on different keys.

Example: A great example of this a MapReduce: Request, is to look at all CTU graduate students and sum up their current outstanding school loans per degree level.  Thus, the final output from our example would be:

  • Doctoral Students Current Outstanding School Loan Amount
  • Master Students Current Outstanding School Loan Amount.

Now let’s assume that this ran in Hadoop, which can do MapReduce.   Also, let’s assume that I could use 50 nodes (threads) to process this transaction request.  The bad data that gets thrown out in the mapper phase would be the Undergraduate Students, given that it does not match the initial search criteria.  The safe data will be those that are associated with Doctoral and Masters Students.  So, during the mapping phase, the threads will assign Doctoral Students to one key, and Master students would get another key.  Each node (thread) will use the same keys for their respective students, thus the keys are similar in all nodes (threads).  The reducer uses these keys and the safe objects in them, to sum up, all of the current outstanding school loan amounts get processed under the correct group.  Thus, once all nodes (threads) use the reducer part, we will have our two amounts:

  • Doctoral Students Current Outstanding School Loan
  • Masters Students Current Outstanding School Loan

Complexity could be added if we only wanted to look into graduate students that are currently active and non-active service members.  Or they could be complicated by gender, profession, diversity signifiers, we can even map to the current industry.

Resources

Adv Topics: MapReduce and Incremental Computation

Data usually gets update on a regular basis. Connolly and Begg (2014) defined that data can be updated incrementally, only small sections of the data, or can be updated completely. An example of data that can be updated incrementally are webpages, computer codes, stale data, data-at-rest, bodies of knowledge, etc. Whereas, some examples of data that can be updated completely are: weather data, space weather data, social media data, data-in-motion, dynamic data, etc. Both sets of data provide their own unique challenges when it comes to data processing. On average, analyzing web data, new to old data can range from 10-1000x (Sakr, 2014). Thus, the focus of this discussion is on incremental data update and how to process data in between two data processing runs.

Incoop is an extension of Hadoop to allow for processing incremental changes on big data, by splitting the main computation to its sub-computation, logging in data updates in a memoization server, while checking the inputs of the input data to each sub-computation (Bhatotia et al., 2011; Sakr, 2014). These sub-computations are usually mappers and reducers (Sakr, 2014). Incremental mappers check against the memoization servers, and if the data has already been processed and unchanged it will not reprocess the data, and a similar process for incremental reducers that check for changed mapper outputs (Bhatotia et al., 2011).

Subsequently, MapReduce is an analytical engine and pattern that takes advantage of distributed systems while keeping the processes and data in one machine (Sadalage & Fowler, 2012). There are a few key principles to using the MapReduce framework and Hadoop efficiently to improve incremental computation:

  • Data partitioning: The MapReduce framework aids in partitioning the data into similar size sets into Hadoop Distributed File System, aka HDFS (Lublinsky, Smith, & Yakubovich, 2013). Thus, MapReduce can support smaller sets of data stored in HDFS. This is part of the scalability of the cluster.
  • Fault tolerance and durability: Given that data can be partitioned to tiny chunks across thousands of computations nodes and run in parallel sometimes these nodes can fail (Sakr, 2014). The MapReduce framework replicates the data in the background and can launch backup jobs if a node fails (Lublinsky et al., 2013; Sakr, 2014). Thus, failure doesn’t disrupt the data processing. However it does increase the number of processors needed (Connolly & Begg, 2014).
  • Parallelization: The partitioned input data are considered as independent sets of data, such that the mapper functions can process the data in a parallel environment (Lublinsky et al., 2013; Sakr, 2014). This principle allows for the sub-functions within the mapper and reducer function to handle smaller data. It allows a data analyst to focus on the main problem rather than low-level parallel coding abstraction, like multithreading, file allocation, memory management, etc. (Sakr, 2014). Serialization does not allow for small incremental updates for large data (Connolly & Begg, 2014).
  • Data reuse: There is no need to read or write of intermediate data, thus preserving the input data to enable the data to be reused because it is unchanged (Lublinsky et al., 2013; Sakr, 2014).
  • Self-Adjusting Computation: Used for incremental computation, which only allows mappers and reducers only work on the smaller size sets of data that are impacted by the change (Sakr, 2014).

Both Bhatotia et al. (2011) and Sakr (2014), suggested an Inc-HDFS which is also an extension of the HDFS, for partitioning data based on content and removal of data duplication. There is the limitation of this approach where the number of files may be grouped in too many content bins or too little content bins and thus may not be evenly be spaced out (Bhatotia et al., 2011). Thus, invoking: too many mapper functions can create an infrastructure overhead, which increases resources and thus cost, or too few mapper functions can create huge workloads for certain types of computational nodes, or too many reducers can provide too many outputs, and too little reducers can provide too little outputs (Lublinsky et al., 2013; Sakr, 2014). A constraint must be added on both ends of the spectrum to allow for evenly distributed data sets (Bhatotia et al., 2011).

Subsequently, the Hadoop out-of-the-box product scheduler doesn’t account for memoization server data, therefore is not built for incremental analysis. Thus, Incoop has a memoization-aware scheduler, that schedules the sub-computations based on the affinity of the task and allows for efficient use of previously (Bhatotia et al., 2011; Sakr, 2014). The scheduler can run tasks on computational nodes that are either faster or locally to where the data is stored (Bhatotia et al., 2011). Using this type of scheduler, the scheduler should place priority on whether it is faster to conduct a data movement and run the task at a faster computational node or reduce data movement, and processes the data locally, while still making effective use of unchanged and processed data.

In the end, a practical application of this technique when it comes to analyzing web data would be to first partition the web data by its content. Scientific content can go in one context partition, corporate financial content into another context partition, etc. under the Inc-HDFS framework. These partitions are capped in size to allow for proper load balance. Incoop will then run the MapReduce function to process the data distributively through using parallel processes. If the data gets updated, like a new corporate update to the SEC 10K data, it will be recorded by the memoization server. This will allow for Incoop to be able to process that incremental change from the corporate financial content partition because it was using the memoization-aware scheduler, and reprocess the data through mapping and reducing function on just this small and partitioned dataset. Therefore, making effective use of unchanged and processed data.

Resources:

  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, (1st ed.). Vitalbook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.ok

Adv Topics: MapReduce and Iterative computation

Data-in-motion is the real-time streaming of data from a broad spectrum of technologies, which also encompasses the data transmission between systems, while data that is stored on a database system or cloud system is considered as data-at-rest and data that is being processed and analyzed is considered as data-in-use (Katal, Wazid, & Goudar, 2013; Kishore & Sharma, 2016; Ovum, 2016; Ramachandran & Chang, 2016). Social media data or social network analysis data can be considered as data-in-motion and processing that type of data can be quite problematic. Data-in-motion has to be iteratively processed until there is a certain termination condition is reached and it can be reached between iterations (Sakr, 2014). Data-at-rest is probably considered easier to analyze; however, this type of data can also be problematic. If the data-at-rest is large in size and even if the data does not change or evolve, its large size requires iterative processes to analyze the data.

An example of an iterative process as suggested by Lusblinsky, Smith, and Yakubovich (2014), is solving a linear equation by approximation algorithms on hundreds of equations and variables. The data can be stored in matrix Ax = b, where A is a matrix of coefficients, b is a vector of output values, and x is a vector of variables. If the data is too large, using a simple linear algebraic solution would be impossible, so a quadratic spline solution would consist of the following:

f(x) = ½ xTAx-xT­b

and a superscript “T” represents transposing the vector or matrix. Each iteration of this spline would result in a better vector solution. However, Sakr (2014) stated that MapReduce does not support iterative data processing and analysis directly. Thus workaround is needed to handle iterative programs for situations like data-in-motion or even streaming data.

 Root causes and technical steps to address them

To deal with datasets that require iterative processes to analyze the data, computer coders need to create and arrange multiple MapReduce functions in a loop (Sakr, 2014). This workaround would increase the processing time of the serialized program because data would have to be reloaded and reprocessed, because there is no read or write of intermediate data, which was there for preserving the input data (Lusblinksy et al., 2014; Sakr, 2014). The root cause exists because in its simplest form MapReduce can consist of many mappers and one reducer, which is a performance bottleneck. This simplified model of the MapReduce analytical engine means one cannot reduce across keys, just one key at a time (Sadalage & Fowler, 2012). Thus, the algorithm is not built for iterations. HaLoop is an iteration solution built on top of the Hadoop infrastructure that has a loop control module, task scheduler, which caches invariant data from a previous iteration and uses it in future iterations (Sakr, 2014). This solution can be applied to static and unchanged data.

There are also disadvantages of using MapReduce on these types of data because too many mapper functions can create an infrastructure overhead or too many reducers can provide too many outputs (Lusblinksy et al., 2014; Sakr, 2014). Thus there has to be a basic implementation plan of effective data placement over the Hadoop cluster, to ensure proper load balance of data across the Hadoop servers (Sakr, 2014). Sometimes, there is a need to run two separate MapReduce functions, one to prepare the data by evenly distributing data across the servers and one that iteratively goes through the data (Lusblinksy et al., 2014). CoHadoop allows for data files and related files to be stored on the same server, provide a means of load balancing and fault tolerance by creating a file-level locator property (Sakr, 2014). Sakr describes the locator property as a means to keep track of where the data is stored via a unique identification number for each file in the system. This solution can also be applied to static and unchanged data.

For data-in-motion and streaming data, data has to be iteratively processed until there is a certain termination condition is reached. MapReduce Online has an approach where between the mapper and reducer function data is pipelined to allow for data processing and analysis as soon as the mappers produce their outputs (Sakr, 2014). This can run the MapReduce functions iteratively and provide relatively live reduced data outputs. Sakr, further explains that this approach is where the reducer contacts every mapper upon initiation of the scheduler and the data is temporarily stored on the pipeline, which is an in-memory buffer.

Resources:

  • Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
  • Kishore, N. & Sharma, S. (2016). Secure data migration from enterprise to cloud storage – analytical survey. BIJIT-BVICAM’s Internal Journal of Information Technology. Retrieved from http://bvicam.ac.in/bijit/downloads/pdf/issue15/09.pdf
  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Ovum (2016). 2017 Trends to watch: Big Data. Retrieved from http://info.ovum.com/uploads/files/2017_Trends_to_Watch_Big_Data.pdf
  • Ramachandran, M. & Chang, V. (2016). Toward validating cloud service providers using business process modeling and simulation. Retrieved from http://eprints.soton.ac.uk/390478/1/cloud_security_bpmn1%20paper%20_accepted.pdf
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, (1st ed.). Vitalbook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.

Adv Topics: MapReduce and Hadoop

Hadoop allows for data processing through MapReduce and it also allows for data storage (Lublinsky et al., 2014). MapReduce is an analytical engine and pattern that takes advantage of distributed systems while keeping the processes and data in one machine (Sadalage & Fowler, 2012). MapReduce thus contains two functions that work in parallel on distributed systems (Hortonworks, 2013; Sadalage & Fowler, 2012; Sakr, 2014; Sathupadi, 2010):

    1. Mappers functions create and process transactions on the system by mapping and aggregating data by key values. Mappers can read only one data record at a time.
    2. Reducers functions know what that key values are and will take all those values stored in a map to reduce the data to what is relevant. Reducers help summarize the data into a single output. This helps deal with the amount of data moving between multiple computational nodes.

Lublinsky, Smith, and Yakubovich, (2014), stated that an intermediate component of MapReduce is known as the shuffle and sort, where the data from the mapping function outputs are moved and presented to the reducer function.

Thus, MapReduce is a framework that uses parallel sequential algorithms that capitalize on cloud architecture, which became popular under the open source Hadoop project, as its main executable analytic engine (Lublinsky et al., 2014; Sadalage & Fowler, 2012; Sakr, 2014). Essentially, a sequential algorithm is a computer program that runs on a sequence of commands, and a parallel algorithm runs a set of sequential commands over separate computational cores (Brookshear & Brylow, 2014; Sakr, 2014). Thus, a parallel sequential algorithm runs a full sequential program over multiple but separate cores (Sakr, 2014). Another feature of MapReduce is that a reduced output can become another’s map function (Sadalage & Fowler, 2012). Subsequently, the advantages and disadvantages of using MapReduce are (Lusblinksy et al., 2014; Sakr, 2014):

+ aggregation techniques under the mapper function can exploit multiple different techniques

+ no read or write of intermediate data, thus preserving the input data

+ no need to serialize or de-serialize code in either memory or processing

+ it is scalable based on the size of data and resources needed for processing the data

+ isolation of the sequential program from data distribution, scheduling, and fault tolerance

– too many mapper functions can create an infrastructure overhead, which increases resources and thus cost

– too few mapper functions can create huge workloads for certain types of computational nodes

– too many reducers can provide too many outputs, and too little reducers can provide too little outputs

 – it’s a different programming paradigm that most programmers are not familiar with

 – the use of available parallelism will be underutilized for smaller data sets

Given that Hadoop is predominately known for popularizing MapReduce tasks, it is also known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems (Rathbone, 2013). Hadoop’s service is part of the cloud (as Platform as a Service = PaaS).  For PaaS, the end users manage the applications and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001). Data is broken up into small blocks, like Legos, such that they are distributed across a distributed database system and across multiple servers and can be processed across all these servers, e.g. Hadoop Cluster (IBM, n.d.).

A common example of a parallel sequential program is dynamical weather forecasting models. In dynamical weather forecasting models, there is a set of defined geodynamic, thermodynamic, and physical sequential algorithms define and evolve the main seven variables of weathers across time. For each time step, the forecasting models run these sequential algorithms over each grid point, which can represent a finite geospatial region. Each of these geospatial regions is split amongst multiple computational scores. This example expands in complexity when data has to travel between different finite geospatial regions through the boundaries, which is an example of data parallelism (Sakr, 2014). MapReduce uses the concept of data parallelism to help map and reduce data. Therefore, weather models could be considered as a loose form of MapReduce algorithm.

Resources:

Data Tools: Hadoop Vs Spark

 

Apache Spark

Apache Spark started from a working group inside and outside of UC Berkley, in search of an open-sourced, multi-pass algorithm batch processing model of MapReduce (Zaharia et al., 2012). Spark can have applications written in Java, Scala, Python, R, and interfaces with SQL, which increases ease of use (Spark, n.d.; Zaharia et al., 2012).

Essentially, Spark is a high-performance computing cluster framework, but it doesn’t have its distributed file system and thus uses Hadoop Distributed File System (HDFS, HBase) as in input and output (Gu & Li, 2013).  Not only can it access data from HDFS, HBase, it can also access data from Cassandra, Hive, Tachyon, and any other Hadoop data source (Spark, n.d.).  However, Spark uses its data structure called Resilient Distribution Datasets (RDD) which cache’s data and is a read-only operation to improve its processing time as long as there is enough memory for it in all the nodes of a cluster (Gu & Li, 2013; Zaharia et al., 2012). Spark tries to avoid data reloading from the disk that is why it stores its data in the node’s cache system, for initial and intermediate results (Gu & Li, 2013).

Machines in the cluster can be rebuilt if lost, thus making the RDDs are fault-tolerant without requiring replication (Gu &LI, 2013; Zaharia et al., 2012).  Each RDD is tracked in a lineage graph, and reruns the operations if data becomes lost, therefore reconstructing data, even if all the nodes running spark were to fail (Zaharia et al., 2012).

Hadoop

Hadoop is Java-based system that allows for manipulation and calculations to be done by calling on MapReduce function on its HDFS system (Hortonworks, 2013; IBM, n.d.).

HFDS big data is broken up into smaller blocks across different locations, no matter the type or amount of data, each of these blocs can be still located, which can be aggregated like a set of Legos throughout a distributed database system (IBM, n.d.; Minelli, Chambers, & Dhiraj, 2013). Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). IBM (n.d.) boasts that the data blocks in the HFDS are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on your data needs), offering fault tolerance as well. Therefore, IBM’s (n.d.) MapReduce functions use the HFDS to run its procedures on the server in which the data is stored, where data is stored in a memory, not in cache and allow for continuous service.

MapReduce contains two job types that work in parallel on distributed systems: (1) Mappers which creates & processes transactions on the system by mapping/aggregating data by key values, and (2) Reducers which know what that key value is, will take all those values stored in a map and reduce the data to what is relevant (Hortonworks, 2013; Sathupadi, 2010). Reducers can work on different keys, and when huge amounts of data are entered into MapReduce, then the Mapper maps the data, where the data is then shuffled and sorted before it is reduced (Hortonworks, 2013).  Once the data is reduced, the researcher gets the output that they sought.

Significant Differences between Hadoop and Apache Spark              

Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Apache Spark, on their website, boasts that they can run programs 100X faster than Hadoop’s MapReduce in Memory (Spark, n.d.). Spark outperforms Hadoop by 10x on iterative machine learning jobs (Gu & Li, 2013). Also, Spark runs 10x faster than Hadoop on disk memory (Spark, n.d.).

Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized.

References

  • Gu, L., & Li, H. (2013). Memory or time: Performance evaluation for iterative operation on hadoop and spark. InHigh Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on (pp. 721-727). IEEE.

Data Tools: Hadoop and how to install it

What is Hadoop

Hadoop’s Distributed File System (HFDS) is where big data is broken up into smaller blocks (IBM, n.d.), which can be aggregated like a set of Legos throughout a distributed database system. Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). Blocks are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on the data needs).

HFDS can support many different data types, even those that are unknown or yet to be classified and it can store a bunch of data.  Thus, Hadoop’s technology to manage big data allows for parallel processing, which can allow for parallel searching, metadata management, parallel analysis (with MapReduce), the establishment of workflow system analysis, etc. (Gary et al., 2005, Hortonworks, 2013, & IBM, n.d.).

Given the massive amounts of data in Big Data that needs to get processed, manipulated, and calculated upon, parallel processing and programming are there to use the benefits of distributed systems to get the job done (Minelli et al., 2013).  Hadoop, which is Java based allows for manipulation and calculations to be done by calling on MapReduce, which pulls on the data which is distributed on its servers, to map key items/objects, and reduces the data to the query at hand (Hortonworks, 2013 & Sathupadi, 2010).

Parallel processing allows making quick work on a big data set, because rather than having one processor doing all the work, Hadoop splits up the task amongst many processors. This is the largest benefit of Hadoop, which allows for parallel processing.  Another advantage of parallel processing is when one processor/node goes out; another node can pick up from where that task last saved safe object task (which can slow down the calculation but by just a bit).  Hadoop knows that this happens all the time with their nodes, so the processor/node create backups of their data as part of their fail safe (IBM, n.d).  This is done so that another processor/node can continue its work on the copied data, which enhances data availability, which in the end gets the task you need to be done now.

Minelli et al. (2013) stated that traditional relational database systems could depend on hardware architecture.  However, Hadoop’s service is part of cloud (as Platform as a Service = PaaS).  For PaaS, we manage the applications, and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001).  The next section discusses how to install Hadoop and how to set up Eclipse to access map/reduce servers.

Installation steps

  • Go to the Hadoop Main Page < http://hadoop.apache.org/ > and scroll down to the getting started section, and click “Download Hadoop from the release page.” (Birajdar, 2015)
  • In the Apache Hadoop Releases < http://hadoop.apache.org/releases.html > Select the link for the “source” code for Hadoop 2.7.3, and then select the first mirror: “http://apache.mirrors.ionfish.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3-src.tar.gz” (Birajdar, 2015)
  • Open the Hadoop-2.7.3 tarball file with a compression file reader like WinRAR archiver < http://www.rarlab.com/download.htm >. Then drag the file into the Local Disk (C:). (Birajdar, 2015)
  • Once the file has been completely transferred to the Local Disk drive, close the tarball file, and open up the hadoop-2.7.3-src folder. (Birajdar, 2015)
  • Download Hadoop 0.18.0 tarball file < https://archive.apache.org/dist/hadoop/core/hadoop-0.18.0/ > and place the copy the “Hadoop-vm-appliance-0-18-0” folder into the Java “jdk1.8.0_101” folder. (Birajdar, 2015; Gnsaheb, 2013)
  • Download Hadoop VM file < http://ydn.zenfs.com/site/hadoop/hadoop-vm-appliance-0-18-0_v1.zip >, unzip it and place it inside the Hadoop src file. (Birajdar, 2015)
  • Open up VMware Workstation 12, and open a virtual machine “Hadoop-appliance-0.18.0.vmx” and select play virtual machine. (Birajdar, 2015)
  • Login: Hadoop-user and password: Hadoop. (Birajdar, 2015; Gnsaheb, 2013)
  • Once in the virtual machine, type “./start-hadoop” and hit enter. (Birajdar, 2015; Gnsaheb, 2013)
    1. To test MapReduce on the VM: bin/Hadoop jar Hadoop-0.18.0-examples.jar pi 10 100000000
      1. You should get a “job finished in X seconds.”
      2. You should get an “estimated value of PI is Y.”
  • To bind MapReduce plugin to eclipse (Gnsaheb, 2013)
    1. Go into the JDK folder, under Hadoop-0.18.0 > contrib> eclipse-plugin > “Hadoop-0.18.0-eclipse-plugin” and place it into the eclipse neon 1 plugin folder “eclipse\plugins”
    2. Open eclipse, then open perspective button> other> map/reduce.
    3. In Eclipse, click on Windows> Show View > other > MapReduce Tools > Map/Reduce location
    4. Adding a server. On the Map/Reduce Location window, click on the elephant
      1. Location name: your choice
      2. Map/Reduce master host: IP address achieved after you log in via the VM
  • Map/Reduce Master Port: 9001
  1. DFS Master Port: 9000
  2. Username: Hadoop-user
  1. Go to the advance parameter tab > mapred.system.dir > edit to /Hadoop/mapped/system

Issues experienced in the installation processes (Discussion of any challenges and explain how it was investigated and solved)

Not one source has the entire solution Birajdar, 2015; Gnsaheb, 2013; Korolev, 2008).  It took a combination of all three sources, to get the same output that each of them has described.  Once the solution was determined to be correct, and the correct versions of the files were located, they were expressed in the instruction set above.  Whenever a person runs into a problem with computer science, google.com is their friend.  The links above will become outdated with time, and methods will change.  Each person’s computer system is different than those from my personal computer system, which is reflected in this instruction manual.  This instruction manual should help others google the right terms and in the right order to get Hadoop installed correctly onto their system.  This process takes about 3-5 hours to install correctly, with the long time it takes to download and install the right files, and with the time to set up everything correctly.

Resources

Big Data Analytics: Compelling Topics

Big Data and Hadoop:

According to Gray et al. (2005), traditional data management relies on arrays and tables in order to analyze objects, which can range from financial data, galaxies, proteins, events, spectra data, 2D weather, etc., but when it comes to N-dimensional arrays there is an “impedance mismatch” between the data and the database.    Big data, can be N-dimensional, which can also vary across time, i.e. text data (Gray et al., 2005). Big data, by its name, is voluminous. Thus, given the massive amounts of data in Big Data that needs to get processed, manipulated, and calculated upon, parallel processing and programming are there to use the benefits of distributed systems to get the job done (Minelli, Chambers, & Dhiraj, 2013).  Parallel processing allows making quick work on a big data set, because rather than having one processor doing all the work, you split up the task amongst many processors.

Hadoop’s Distributed File System (HFDS), breaks up big data into smaller blocks (IBM, n.d.), which can be aggregated like a set of Legos throughout a distributed database system. Data blocks are distributed across multiple servers. Hadoop is Java-based and pulls on the data that is stored on their distributed servers, to map key items/objects, and reduces the data to the query at hand (MapReduce function). Hadoop is built to deal with big data stored in the cloud.

Cloud Computing:

Clouds come in three different privacy flavors: Public (all customers and companies share the all same resources), Private (only one group of clients or company can use a particular cloud resources), and Hybrid (some aspects of the cloud are public while others are private depending on the data sensitivity.  Cloud technology encompasses Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).  These types of cloud differ in what the company managers on what is managed by the cloud provider (Lau, 2011).  Cloud differs from the conventional data centers where the company managed it all: application, data, O/S, virtualization, servers, storage, and networking.  Cloud is replacing the conventional data center because infrastructure costs are high.  For a company to be spending that much money on a conventional data center that will get outdated in 18 months (Moore’s law of technology), it’s just a constant sink in money.  Thus, outsourcing the data center infrastructure is the first step of company’s movement into the cloud.

Key Components to Success:

You need to have the buy-in of the leaders and employees when it comes to using big data analytics for predictive, prescriptive or descriptive purposes.  When it came to buy-in, Lt. Palmer had to nurture top-down support as well as buy-in from the bottom-up (ranks).  It was much harder to get buy-in from more experienced detectives, who feel that the introduction of tools like analytics, is a way to tell them to give up their long-standing practices and even replace them.  So, Lt. Palmer had sold Blue PALMS as “What’s worked best for us is proving [the value of Blue PALMS] one case at a time, and stressing that it’s a tool, that it’s a compliment to their skills and experience, not a substitute”.  Lt. Palmer got buy-in from a senior and well-respected officer, by helping him solve a case.  The senior officer had a suspect in mind, and after feeding in the data, the tool was able to predict 20 people that could have done it in an order of most likely.  The suspect was on the top five, and when apprehended, the suspect confessed.  Doing, this case by case has built the trust amongst veteran officers and thus eventually got their buy in.

Applications of Big Data Analytics:

A result of Big Data Analytics is online profiling.  Online profiling is using a person’s online identity to collect information about them, their behaviors, their interactions, their tastes, etc. to drive a targeted advertising (McNurlin et al., 2008).  Profiling has its roots in third party cookies and profiling has now evolved to include 40 different variables that are collected from the consumer (Pophal, 2014).  Online profiling allows for marketers to send personalized and “perfect” advertisements to the consumer, instantly.

Moving from online profiling to studying social media, He, Zha, and Li (2013) stated their theory, that with higher positive customer engagement, customers can become brand advocates, which increases their brand loyalty and push referrals to their friends, and approximately 1/3 people followed a friend’s referral if done through social media. This insight came through analyzing the social media data from Pizza Hut, Dominos and Papa Johns, as they aim to control more of the market share to increase their revenue.  But, is this aiding in protecting people’s privacy when we analyze their social media content when they interact with a company?

HIPAA described how we should conduct de-identification of 18 identifiers/variables that would help protect people from ethical issues that could arise from big data.   HIPAA legislation is not standardized for all big data applications/cases; it is good practice. However, HIPAA legislation is mostly concerned with the health care industry, listing those 18 identifiers that have to be de-identified: Names, Geographic data, Dates, Telephone Numbers, VIN, Fax, Device ID and serial numbers, emails addresses, URLs, SSN, IP address, Medical Record Numbers, Biometric ID (fingerprints, iris scans, voice prints, etc), full face photos, health plan beneficiary numbers, account numbers, any other unique ID number (characteristic, codes, etc), and certifications/license numbers (HHS, n.d.).  We must be aware that HIPAA compliance is more a feature of the data collector and data owner than the cloud provider.

HIPAA arose from the human genome project 25 years ago, where they were trying to sequence its first 3B base pair of the human genome over a 13 year period (Green, Watson, & Collins, 2015).  This 3B base pair is about 100 GB uncompressed and by 2011, 13 quadrillion bases were sequenced (O’Driscoll et al., 2013). Studying genomic data comes with a whole host of ethical issues.  Some of those were addressed by the HIPPA legislation while other issues are left unresolved today.

One of the ethical issues that arose were mentioned in McEwen et al. (2013), for people who have submitted their genomic data 25 years ago can that data be used today in other studies? What about if it was used to help the participants of 25 years ago to take preventative measures for adverse health conditions?  However, ethical issues extend beyond privacy and compliance.  McEwen et al. (2013) warn that data has been collected for 25 years, and what if data from 20 years ago provides data that a participant can suffer an adverse health condition that could be preventable.  What is the duty of the researchers today to that participant?

Resources: