Pros and Cons of Hadoop MapReduce

The are some of the advantages and disadvantages of using MapReduce are (Lusblinksy et al., 2014; Sakr, 2014):


  • Hadoop is ideal because it is a highly scalable platform that is cost-effective for many businesses.
  • It supports huge computations, particularly in parallel execution.
  • It isolates low-level applications such as fault-tolerance, scheduling, and data distribution.
  • It supports parallelism for program execution.
  • It allows easier fault tolerance.
  • Has a highly scalable redundant array of independent nodes
  • It has a cheap unreliable computer or commodity hardware.
  • Aggregation techniques under the mapper function can exploit multiple different techniques
  • No read or write of intermediate data, thus preserving the input data
  • No need to serialize or de-serialize code in either memory or processing
  • It is scalable based on the size of data and resources needed for processing the data
  • Isolation of the sequential program from data distribution, scheduling, and fault tolerance


  • The product is not ideal for real-time process data. During the map phase, the process creates too many keys, which consume sorting time. 
  • Most of the MapReduce outputs are merged.
  • MapReduce cannot use natural indices.
  • It is a must to buffer all the records for a particular join from the input relations in repartition join.
  • Users of the MapReduce framework use textual formats that are inefficient.
  • There is a huge waste of CPU resources, network bandwidth, and I/O since data must be reprocessed and loaded at every iteration.
  • The common framework of MapReduce doesn’t support applications designed for iterative data analysis.
  • When a fixed point is reached, detection may be the termination condition that calls for more MapReduce job that incurs overhead.
  • The framework of MapReduce doesn’t allow building one task from multiple data sets.
  • Too many mapper functions can create an infrastructure overhead, which increases resources and thus cost 
  • Too few mapper functions can create huge workloads for certain types of computational nodes
  • Too many reducers can provide too many outputs, and too few reducers can provide too few outputs
  • It’s a different programming paradigm that most programmers are not familiar with
  • The use of available parallelism will be underutilized for smaller data sets


  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.

To outsourcing social media or not

The advantages and disadvantages of outsourcing social media:

+ Outsourcing companies have the expertise to manage multiple social media platforms and understanding which are the best social media platforms to help realize a social business strategy (Craig, 2013; VerticalResponse, 2013).

+ Outsourcing companies can be a time saver, because it can take quite a while to learn about all the different types of social platforms out there, to grow an engaging audience, and to fully realize a social business strategy (VerticalResponse, 2013). Also, outsourcing companies don’t have to be bogged down with another day to day tasks of business requiring their services (Craig, 2013).

+ Outsourcing companies can be a fiscal resource saver, given that accomplishing a social business strategy can takes resources away from core business activities (VerticalResponse, 2013).

– Outsourcing company may not fully understand the business nor the industry in which the business resides (Craig, 2013; Thomas, 2009).

– Outsourcing company is not part of the everyday marathon of business (Thomas, 2009).

– An outsourcing company cannot be 100% authentic when responding to the customers because they are not the actual voice of the company (Craig, 2013; Thomas, 2009; VerticalResponse, 2013).

Two social media components that are more likely to be outsourced:

  • Setting up the multiple social media platform profiles, due to the tedious tasks of filling out the same standard fields/details in each account (Baroncini-Moe, 2010). Social media platforms like LinkedIn, Twitter, Facebook, etc. all have standard fields like a short bio, name, photo, username selection, user/corporate identity verification, etc. This non-value-added work can consume valuable time, yet do not compromise a corporate’s authentic voice.
  • Automation of some status updates across some/all social media platforms (Baroncini-Moe, 2010). For instance, a post made on a blogging website could also be forwarded to a LinkedIn and Twitter profile, but not in Facebook. This differentiation should be explicitly stated in the social business strategy.


Data Tools: Use of XML

XML advantages

+ Writing your markup language and are not limited to the tags defined by other people (UK Web Design Company, n.d.)

+ Creating your tags at your pace rather than waiting for a standard body to approve of the tag structure (UK Web Design Company, n.d.)

+ Allows for a specific industry or person to design and create their set of tags that meet their unique problem, context, and needs (Brewton, Yuan, & Akowuah, 2012; UK Web Design Company, n.d.)

+ It is both human and machine-readable format (Hiroshi, 2007)

+ Used for data storage and processing both online and offline (Hiroshi, 2007)

+ Platform independent with forward and backward capability (Brewton et al., 2012; Hiroshi, 2007)

XML disadvantages

– Searching for information in the data is tough and time-consuming without a computer processing application (UK Web Design Company, n.d.)

– Data is tied to the logic and language similar to HTML without a readily made browser to simply explore the data and therefore may require HTML or other software to process the data (Brewton et al., 2012; UK Web Design Company, n.d.)

– Syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007)

– Limited to relational models and object-oriented graphs (Hiroshi, 2007)

– Tags are chosen by their creator. Thus there are no standard set of tags that should be used (Brewton et al., 2012)

XML use in Healthcare Industry

Thanks to the American National Standards Institute, the Health Level 7 (HL7) was created with standards for health care XML, which is now in use by 90% of all large hospitals (Brewton et al., 2012; Institute of Medicine, 2004). The Institute of Medicine (2004), stated that health care data could consist of: allergies immunizations, social histories, histories, vital signs, physical examination, physician’s and nurse’s notes, laboratory tests, diagnostic tests, radiology test, diagnoses, medications, procedures, clinical documentations, clinical measure for specific clinical conditions, patient instructions, dispositions, health maintenance schedules, etc.  More complex datasets like images, sounds, and other types of multimedia, are yet to be included (Brewton et al., 2012).  Also, terminologies within the data elements are not systemized nomenclature, and it does not support web-protocols for more advanced communications of health data (Institute of Medicine, 2004). HL7 V3 should resolve a lot of these issues, which should also account for a wide variety of health care scenarios (Brewton et al., 2012).

XML use in Astronomy

The Flexible Image Transport System (FITS), currently used by NASA/Goddard Space Flight Center, holds images, spectra, tables, and sky atlases data, which has been in use for 30 years (NASA, 2016; Pence et al. 2010). The newest version has a definition of time coordinates, support of long string keywords, multiple keywords, checksum keywords, image and table compression standards (NASA, 2016).  There was support for mandatory keywords previously (Pence et al. 2010).  Besides the differences in data entities and therefore tags needed to describe the data between the XML for healthcare and astronomy, the use of XML for a much longer period has allowed for a more robust solution that has evolved with technology.  It is also widely used as it is endorsed by the International Astronomical Union (NASA, 2016; Pence et al., 2010).  Based on the maturity of FITS, due to its creations in the late 1970s, and the fact that it is still in use, heavily endorsed, and is a standard still in use today, the healthcare industry could learn something from this system.  The only problem with FITS is that it removes some of the benefits of XML, which includes flexibility to create your tags due to the heavy standardization and standardization body.


Traditional Forecasting Vs. Scenario Planning

Traditional Forecasting

Traditional forecast is essentially extrapolating where you were and where are you are now into the future, and at the end of this extrapolated line this is “the most likely scenario” (Wade, 2012; Wade, 2014).  Mathematical formulations and extrapolations is a mechanical basis for traditional forecasting (Wade, 2012). At one point, these forecasts make ±5-10% in their projections and call it the “the best and worst case scenario” (Wade, 2012; Wade, 2014).  This ± difference is a range of possibilities out of an actual 360o solution spherical space (Wade, 2014). There are both mathematical forms of extrapolation and mental forms of extrapolation and both are quite dangerous because they assume that the world doesn’t change much (Wade, 2012).  However, disruptions like new political situations, new management ideas, new economic situations, new regulations, new technological developments, a new competitor, new customer behavior, new societal attitudes and new geopolitical tensions, could move this forecast in either direction, such that it is no longer accurate (Wade, 2014). We shouldn’t just forecast the future via extrapolation; we should start to anticipate it through scenario analysis (Wade, 2012).

Advantages (Wade, 2012; Wade, 2014):

+ Simple to personally understand, only three outcomes, with one that is “the most likely scenario.”

+ Simple for managements to understand and move forward on

Disadvantages (Wade, 2012; Wade, 2014):

– Considered persistence forecasting, which is the least accurate in the long term

– Fails to take into account disruptions that may impact the scenario that is being analyzed

– Leads to a false sense of security that could be fatal in some situations

– A rigid technique that doesn’t allow for flexibility.

Scenario Planning

Scenario planning could be done with 9-30 participants (Wade, 2012).  But, a key requirement of scenario planning is for everyone to understand that knowing the future is impossible and yet people want to know where the future could go (Wade, 2014).  However, it is important to note that scenarios are not predictions; scenarios only illuminate different ways the future may unfold (Wade, 2012)!

Therefore, this tool to come up with an approach that is creative, yet methodological, that would help spell out some of the future scenarios that could happen has ten steps (Wade, 2012; Wade, 2014):

  • Framing the challenge
  • Gathering information
  • Identifying driving forces
  • Defining the future’s critical “either/or” uncertainties
  • Generating the scenarios
  • Fleshing them out and creating story lines
  • Validating the scenarios and identifying future research needs
  • Assessing their implications and defining possible responses
  • Identifying signposts
  • Monitoring and updating the scenarios as times goes on

However, in a talk Wade (2014), distilled his 10 step process, to help cover the core steps in scenario planning:

  • Create a brainstorming session to identify as many of the driving force(s) or trend(s) that could have an impact on the problem at hand? Think of any trend or force (direct, indirect, or very indirect) that would have any effect in any way and any magnitude to the problem and they could fall under the following categories:
    • Political
    • Economical
    • Environmental
    • Societal
    • Technological
  • Next, the group must understand the critical uncertainties in the future, from the overwhelming list. There are three types of uncertainties:
    • Some forces have a very low impact but very in uncertainty called secondary elements.
    • Some forces have a very high impact but low uncertainty called predetermined elements.
    • Some forces have a very high impact and high uncertainty call critical uncertainties.
  • Subsequently, select the top two most critical uncertainties and model the most extreme cases of each outcome, it is “either … or …”. They must be contrasting landscapes from each other. Place one critical uncertainty’s either/or in one axis, and the other on the other axis.
  • Finally, the group should describe the different types of scenarios. What would be the key challenges and key issues would be faced in either of these four different scenarios? How should the responses look like?  What are the opportunities and the challenges will be faced? This helps the group to strategically plan and find a way to potentially innovate in this landscape, to outthink their competitors (Wade, 2014)?

Advantages (Frum, 2013; Wade, 2012; Wade, 2014):

+ Focuses on the top two most critical uncertainties to drive simplicity

+ Helps define the extremes in the four different Landscapes and their unique Challenges, Responses, and Opportunities to innovate to create a portfolio of future scenarios

+ An analytical planning method helping to discover the Strengths, Weaknesses, Opportunities, and Threats affecting each scenario

+ Helps you focus on the players in each landscape: competitors, customers, suppliers, employees, key stakeholders, etc.

Disadvantages (Wade, 2012; Wade, 2014):

– No one has a crystal ball

– More time consuming than traditional forecasting

– Only focuses on 2 of the most critical uncertainties, in the real world there are more critical uncertainties needed for analysis.


Adv Quant: Decision Trees

Decision Trees

Humans when facing a decision tend to seek out a path, solution, or option that appears closest to the goal (Brookshear & Brylow, 2014). Decision trees are helpful as they are predictive models (Ahlemeyer-Stubbe & Coleman, 2014).  Thus, decisions tree aid in data abstraction and finding patterns in an intuitive way (Ahlemeyer-Stubbe & Coleman, 2014; Brookshear & Brylow, 2014; Conolly & Begg, 2014) and aid the decision-making process by mapping out all the paths, solutions, or options available for the decision maker to decide upon.  Every decision is different and varies in complexity. Therefore there is no way to write a simple and well thought out decision tree (Sadalage & Fowler, 2012).

Ahlemeyer-Stubbe and Coleman (2014) stated that the decision trees are a great way to identify possible variables for inclusion in statistical models that are mutually exclusive and collectively exhaustive, even if the relationship between the target and inputs are weak. To help facilitate decision making, each node on a decision tree can have questions attached to it that needs to be asked with leaves associated with each node that represents the differing answers (McNurlin, Sprague, & Bui, 2008). The variable with the strongest influence becomes the top most branch of the decision tree (Ahlemeyer-Stubbe & Coleman, 2014). Chaudhuri, Lo, Loh, & Yang (1995) defines regression decision trees as those where the target question/variable is either continuous, real, or logistic yielding. Murthy (1998), confirms this definition for regression decision trees, while also defining that when to target question/variables needs to be split up into different, finite, and discrete classes is what defines classification decision trees.

Aiming to mirror the way human brain works, the classification decision trees can be created by using neural networks algorithms, which contains a connection of nodes that can have multiple inputs, outputs and processes in each node (Ahlemeyer-Stubbe & Coleman, 2014; Connolly & Begg, 2014). Neural network algorithms contrast the typical decision trees, which usually have one input, one output, and one process per node (similar to Figure 1). Once a root question has been identified, the decision tree algorithm keeps recursively iterating through the data, in an aim to answer the root question (Ahlemeyer-Stubbe & Coleman, 2014).

However, the larger the decision tree, the weaker the leaves get, because the model is tending to overfit the data. Thus thresholds could be applied to the decision tree modeling algorithm to prune back the unstable leaves (Ahlemeyer-Stubbe & Coleman, 2014).  Thus, when looking for a decision tree algorithm to parse through data, it is best to find one that has pruning capabilities.


Figure 1: A left-to-right decision tree on whether or not to take an umbrella, assuming the person is going to spend any amount of time outside during the day.

Advantages of a decision tree

According to Ahlemeyer-Stubbe & Coleman (2014) some of the advantages of using decision tress are:

+ Few assumptions are needed about the distribution of the data

+ Few assumptions are needed about the linearity

+ Decision trees are not sensitive to outliers

+ Decision trees are best for large data, because of their adaptability and minimal assumptions needed to begin parsing the data

+ For logistic and linear regression trees, parameter estimation and hypothesis testing are possible

+ For neural network (Classification) decision trees, predictive equations can be derived

According to Murthy (1998) the advantages of using classification decision trees are:

+ Pre-classified examples mitigate the needs for a subject matter expert knowledge

+ It is an exploratory method as opposes to inferential method

According to Chaudhuri et al. (1995) the advantages of using a regression decision tree are:

+ It can easily handle model complexity in an easily interpretable way

+ Covariates values are conveyed by the tree structure

+ Statistical properties can be derived and studied


  • Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, 12th Edition. [VitalSource Bookshelf Online].
  • Chaudhuri, P., Lo, W. D., Loh, W. Y., & Yang, C. C. (1995). Generalized regression trees. Statistica Sinica, 641-666. Retrieved from
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • McNurlin, B., Sprague, R., & Bui, T. (2008). Information Systems Management, 8th Edition. [VitalSource Bookshelf Online].
  • Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data mining and knowledge discovery2(4), 345-389. Retrieved from
  • Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [VitalSource Bookshelf Online].

Adv Quant: Bayesian Analysis

Uncertainty in making decisions

Generalizing something that is specific from a statistical standpoint, is the problem of induction, and that can cause uncertainty in making decisions (Spiegelhalter & Rice, 2009). Uncertainty in making a decision could also arise from not knowing how to incorporate new data with old assumptions (Hubbard, 2010).

According to Hubbard (2010) conventional statistics assumes:

(1)    The researcher has no prior information about the range of possible values (which is never true) or,

(2)    The researcher does have prior knowledge that the distribution of the population and it is never any of the messy ones (which is not true more often than not)

Thus, knowledge before data collection and the knowledge gained from data collection doesn’t tell the full story until they are combined, hence the need for Bayes’ analysis (Hubbard, 2010).  Bayes’ theory can be reduced to a conditional probability that aims to take into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).  Bayesian analysis avoids overconfidence and underconfidence from ignoring prior data or ignoring new data (Hubbard, 2010), through the implementation of the equation below:

 eq4                           (1)

Where P(hypothesis|data) is the posterior data, P(hypothesis) is the true probability of the hypothesis/distribution before the data is introduced, P(data) marginal probability, and P(data|hypothesis) is the likelihood that the hypothesis/distribution is still true after the data is introduced (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).  This forces the researcher to think about the likelihood that different and new observations could impact a current hypothesis (Hubbard, 2010). Equation (1) shows that evidence is usually a result of two conditional probabilities, where the strongest evidence comes from a low probability that the new data could have led to X (Yudkowsky, 2003).  From these two conditional probabilities, the resultant value is approximately the average from that of the prior assumptions and the new data gained (Hubbard, 2010; Smith, 2015).  Smith (2015) describe this approximation in the following simplified relationship (equation 2):

 eq5.PNG                                            (2)

Therefore, from equation (2) the type of prior assumptions influence the posterior resultant. Prior distributions come from Uniform, Constant, or Normal distribution that results in a Normal posterior distribution and a Beta or Binomial distribution results in a Beta posterior distribution (Smith, 2015).  To use Bayesian Analysis one must take into account the analysis’ assumptions.

Basic Assumptions of Bayesian Analysis

Though these three assumptions are great to have for Bayesian Analysis, it has been argued that they are quite unrealistic when real life data, particularly unstructured text-based data (Lewis, 1998; Turhan & Bener, 2009):

  • Each of the new data samples is independent of each other and identically distributed (Lewis, 1998; Nigam & Ghani, 2000; Turhan & Bener, 2009)
  • Each attribute has equation importance (Turhan & Bener, 2009)
  • The new data is compatible with the target posterior (Nigam & Ghani, 2000; Smith 2015).

Applications of Bayesian Analysis

There are typically three main situations where Bayesian Analysis is used (Spiegelhalter, & Rice, 2009):

  • Small data situations: The researcher has no choice but to include prior quantitative information, because of a lack of data, or lack of a distribution model.
  • Moderate size data situations: The researcher has multiple sources of data. They can create a hierarchical model on the assumption of similar prior distributions
  • Big data situations: where there are huge join probability models, with 1000s of data points or parameters, which can then be used to help make inferences of unknown aspects of the data

Pros and Cons

Applying Bayesian Analytics to data has its advantages and disadvantages.  Those Advantages and Disadvantages with Bayesian Analysis as identified by SAS (n.d.) are:


+    Allows for a combination of prior information with data, for a strong decision-making

+    No reliance on asymptotic approximation, thus the inferences are conditional on the data

+    Provides easily interpretive results.


– Posteriors are heavily influenced by their priors.

– This method doesn’t help the researcher to select the proper prior, given how much influence it has on the posterior.

– Computationally expensive with large data sets.

The key takeaway from this discussion is that the prior knowledge can heavily influence the posterior, which can easily be seen in equation (2).  That is because knowledge before data collection and the knowledge gained from data collection doesn’t tell the full story unless they are combined.


Quantitative Vs Qualitative Analysis

Field (2013) states that both quantitative and qualitative methods are complimentary at best not competing approaches to solving the world’s problems. Although these methods are quite different from each other. Creswell (2014) explain how these two, quantitative and qualitative methods, can be combined to study a phenomenon through what is called a “Mixed Method” Approach, which is out of scope for this discussion.  Simply put, quantitative methods are utilized when the research contains variables that are numerical, and qualitative methods are utilized when the research contains variables that are based on language (Field, 2013).  Thus, each methods goals and procedures are quite different

Goals and procedures

Quantitative methods derive from positivist, numerically driven, and epistemological (Joyner, 2012).   Quantitative methods use closed-ended questions, i.e. hypothesis, and collect their data numerically through instruments (Creswell, 2014). In quantitative research, there is an emphasis on experiments, measurement, and a search of relationships via fitting data to a statistical model and through observing a collection of data graphically to identify trends via deduction (Field, 2013; Joyner, 2012). According to Creswell (2014), quantitative researchers build protections against biases and control for alternative explanations through experiments which are generalizable and replicable. Quantitative studies could be experimental, quasi-experimental, causal-comparative, correlational, descriptive, and evaluation (Joyner, 2012).  According to Edmondson and McManus (2007), quantitative methodologies fit best when the underlying research theory is mature.  The maturity of the theory should tend to drive researchers towards one method over the other, along the spectrum quantitative, mixed, or qualitative methodologies (Creswell, 2014; Edmondson & McManus, 2007).

Comparatively, Edmondson and McManus (2007) stated, qualitative methodologies fit best when the underlying research theory is nascent. Quantitative methods derive from phenomenological view, the perceptions of people (Joyner, 2012).  Qualitative methods use open-ended questions, i.e. interview questions and collect their data through observations of a situation (Creswell, 2014).  Qualitative research focuses on meaning and understanding of a situation where the researcher searches for meaning through interpretation of the data via induction (Creswell, 2014; Joyner, 2012).  Qualitative research could be case studies, ethnographic, action, philosophical, historical, legal, educational, etc. (Joyner, 2012).

Commonalities and differences

The commonalities that exist between these two methods is that each method has a question to answer, an identified area of interest (Creswell, 2014; Edmonson & McManus, 2007; Field, 2013; Joyner 2012).  Each method requires a survey of the current literature to help develop the research question (Creswell, 2014; Edmondson & McManus, 2007). Finally, there is a need to design a study to collect and analyze data to help answer that research question (Creswell, 2014; Edmonson & McManus, 2007; Field, 2013; Joyner 2012).  Therefore, the similarities between these two methods exist on why research is conducted and at a high level the what and the how research is conducted.  They differ in the particulars of the what and the how research is conduction.

The research question(s) can either become a centralized question with(out) sub-questions, but in quantitative research is driven by a series of statistically testable theoretical-hypothesis (Creswell, 2014; Edmonson & McManus, 2007). For quantitative methods data analysis, statistical tests are done to seek relationships, with hopes of testing a theory-driven hypothesis and providing a precise model, via a collection of numerical measures and established constructs (Edmonson & McManus, 2007). Given the need to statistically accept or reject theoretical-hypothesis, the sample size for a quantitative methods tend to be greater than those of qualitative methods (Creswell, 2014).  Qualitative research is driven by exploration and observations to test their hypothesis (Creswell, 2014; Edmonson & McManus, 2007). For qualitative methods data analysis, there should be an iterative and explorative content analysis, with hopes to build a new construct (Edmonson & McManus, 2007).  These are some of many other differences that exist between these two methods.

When are the advantages of quantitative methods maximized

Based off of Edmondson and McManus (2007), the best time to use quantitative methods is when the underlying theory of the research subject is mature.  Maturity consists of extensive literature that could be reviewed, the existence of theoretical constructs, and extensively tested measures (Edmondson & McManus, 2007).  Thus, the application of quantitative methods will help build effectively on prior work which will help fill in the gap of knowledge on a particular topic, whereas qualitative methods and mixed methods would fail to do so. Applying quantitative methods to a mature theory is reinventing the wheel, and applying mixed methods to it, will uneven the status of the evidence (Edmondson & McManus, 2007).


  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Edmondson, A. C., & McManus, S. E. (2007). Methodological fit in management field research. Academy of Management Review, 32(4), 1155–1179.
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Joyner, R. L. (2012) Writing the Winning Thesis or Dissertation: A Step-by-Step Guide (3rd ed.). Corwin. VitalBook file.