Pros and Cons of Hadoop MapReduce

The are some of the advantages and disadvantages of using MapReduce are (Lusblinksy et al., 2014; Sakr, 2014):

Advantages 

  • Hadoop is ideal because it is a highly scalable platform that is cost-effective for many businesses.
  • It supports huge computations, particularly in parallel execution.
  • It isolates low-level applications such as fault-tolerance, scheduling, and data distribution.
  • It supports parallelism for program execution.
  • It allows easier fault tolerance.
  • Has a highly scalable redundant array of independent nodes
  • It has a cheap unreliable computer or commodity hardware.
  • Aggregation techniques under the mapper function can exploit multiple different techniques
  • No read or write of intermediate data, thus preserving the input data
  • No need to serialize or de-serialize code in either memory or processing
  • It is scalable based on the size of data and resources needed for processing the data
  • Isolation of the sequential program from data distribution, scheduling, and fault tolerance

Disadvantages 

  • The product is not ideal for real-time process data. During the map phase, the process creates too many keys, which consume sorting time. 
  • Most of the MapReduce outputs are merged.
  • MapReduce cannot use natural indices.
  • It is a must to buffer all the records for a particular join from the input relations in repartition join.
  • Users of the MapReduce framework use textual formats that are inefficient.
  • There is a huge waste of CPU resources, network bandwidth, and I/O since data must be reprocessed and loaded at every iteration.
  • The common framework of MapReduce doesn’t support applications designed for iterative data analysis.
  • When a fixed point is reached, detection may be the termination condition that calls for more MapReduce job that incurs overhead.
  • The framework of MapReduce doesn’t allow building one task from multiple data sets.
  • Too many mapper functions can create an infrastructure overhead, which increases resources and thus cost 
  • Too few mapper functions can create huge workloads for certain types of computational nodes
  • Too many reducers can provide too many outputs, and too few reducers can provide too few outputs
  • It’s a different programming paradigm that most programmers are not familiar with
  • The use of available parallelism will be underutilized for smaller data sets

Resources

  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.

To outsourcing social media or not

The advantages and disadvantages of outsourcing social media:

+ Outsourcing companies have the expertise to manage multiple social media platforms and understanding which are the best social media platforms to help realize a social business strategy (Craig, 2013; VerticalResponse, 2013).

+ Outsourcing companies can be a time saver, because it can take quite a while to learn about all the different types of social platforms out there, to grow an engaging audience, and to fully realize a social business strategy (VerticalResponse, 2013). Also, outsourcing companies don’t have to be bogged down with another day to day tasks of business requiring their services (Craig, 2013).

+ Outsourcing companies can be a fiscal resource saver, given that accomplishing a social business strategy can takes resources away from core business activities (VerticalResponse, 2013).

– Outsourcing company may not fully understand the business nor the industry in which the business resides (Craig, 2013; Thomas, 2009).

– Outsourcing company is not part of the everyday marathon of business (Thomas, 2009).

– An outsourcing company cannot be 100% authentic when responding to the customers because they are not the actual voice of the company (Craig, 2013; Thomas, 2009; VerticalResponse, 2013).

Two social media components that are more likely to be outsourced:

  • Setting up the multiple social media platform profiles, due to the tedious tasks of filling out the same standard fields/details in each account (Baroncini-Moe, 2010). Social media platforms like LinkedIn, Twitter, Facebook, etc. all have standard fields like a short bio, name, photo, username selection, user/corporate identity verification, etc. This non-value-added work can consume valuable time, yet do not compromise a corporate’s authentic voice.
  • Automation of some status updates across some/all social media platforms (Baroncini-Moe, 2010). For instance, a post made on a blogging website could also be forwarded to a LinkedIn and Twitter profile, but not in Facebook. This differentiation should be explicitly stated in the social business strategy.

References

Data Tools: Use of XML

XML advantages

+ Writing your markup language and are not limited to the tags defined by other people (UK Web Design Company, n.d.)

+ Creating your tags at your pace rather than waiting for a standard body to approve of the tag structure (UK Web Design Company, n.d.)

+ Allows for a specific industry or person to design and create their set of tags that meet their unique problem, context, and needs (Brewton, Yuan, & Akowuah, 2012; UK Web Design Company, n.d.)

+ It is both human and machine-readable format (Hiroshi, 2007)

+ Used for data storage and processing both online and offline (Hiroshi, 2007)

+ Platform independent with forward and backward capability (Brewton et al., 2012; Hiroshi, 2007)

XML disadvantages

– Searching for information in the data is tough and time-consuming without a computer processing application (UK Web Design Company, n.d.)

– Data is tied to the logic and language similar to HTML without a readily made browser to simply explore the data and therefore may require HTML or other software to process the data (Brewton et al., 2012; UK Web Design Company, n.d.)

– Syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007)

– Limited to relational models and object-oriented graphs (Hiroshi, 2007)

– Tags are chosen by their creator. Thus there are no standard set of tags that should be used (Brewton et al., 2012)

XML use in Healthcare Industry

Thanks to the American National Standards Institute, the Health Level 7 (HL7) was created with standards for health care XML, which is now in use by 90% of all large hospitals (Brewton et al., 2012; Institute of Medicine, 2004). The Institute of Medicine (2004), stated that health care data could consist of: allergies immunizations, social histories, histories, vital signs, physical examination, physician’s and nurse’s notes, laboratory tests, diagnostic tests, radiology test, diagnoses, medications, procedures, clinical documentations, clinical measure for specific clinical conditions, patient instructions, dispositions, health maintenance schedules, etc.  More complex datasets like images, sounds, and other types of multimedia, are yet to be included (Brewton et al., 2012).  Also, terminologies within the data elements are not systemized nomenclature, and it does not support web-protocols for more advanced communications of health data (Institute of Medicine, 2004). HL7 V3 should resolve a lot of these issues, which should also account for a wide variety of health care scenarios (Brewton et al., 2012).

XML use in Astronomy

The Flexible Image Transport System (FITS), currently used by NASA/Goddard Space Flight Center, holds images, spectra, tables, and sky atlases data, which has been in use for 30 years (NASA, 2016; Pence et al. 2010). The newest version has a definition of time coordinates, support of long string keywords, multiple keywords, checksum keywords, image and table compression standards (NASA, 2016).  There was support for mandatory keywords previously (Pence et al. 2010).  Besides the differences in data entities and therefore tags needed to describe the data between the XML for healthcare and astronomy, the use of XML for a much longer period has allowed for a more robust solution that has evolved with technology.  It is also widely used as it is endorsed by the International Astronomical Union (NASA, 2016; Pence et al., 2010).  Based on the maturity of FITS, due to its creations in the late 1970s, and the fact that it is still in use, heavily endorsed, and is a standard still in use today, the healthcare industry could learn something from this system.  The only problem with FITS is that it removes some of the benefits of XML, which includes flexibility to create your tags due to the heavy standardization and standardization body.

Resources

Traditional Forecasting Vs. Scenario Planning

Traditional Forecasting

Traditional forecast is essentially extrapolating where you were and where are you are now into the future, and at the end of this extrapolated line this is “the most likely scenario” (Wade, 2012; Wade, 2014).  Mathematical formulations and extrapolations is a mechanical basis for traditional forecasting (Wade, 2012). At one point, these forecasts make ±5-10% in their projections and call it the “the best and worst case scenario” (Wade, 2012; Wade, 2014).  This ± difference is a range of possibilities out of an actual 360o solution spherical space (Wade, 2014). There are both mathematical forms of extrapolation and mental forms of extrapolation and both are quite dangerous because they assume that the world doesn’t change much (Wade, 2012).  However, disruptions like new political situations, new management ideas, new economic situations, new regulations, new technological developments, a new competitor, new customer behavior, new societal attitudes and new geopolitical tensions, could move this forecast in either direction, such that it is no longer accurate (Wade, 2014). We shouldn’t just forecast the future via extrapolation; we should start to anticipate it through scenario analysis (Wade, 2012).

Advantages (Wade, 2012; Wade, 2014):

+ Simple to personally understand, only three outcomes, with one that is “the most likely scenario.”

+ Simple for managements to understand and move forward on

Disadvantages (Wade, 2012; Wade, 2014):

– Considered persistence forecasting, which is the least accurate in the long term

– Fails to take into account disruptions that may impact the scenario that is being analyzed

– Leads to a false sense of security that could be fatal in some situations

– A rigid technique that doesn’t allow for flexibility.

Scenario Planning

Scenario planning could be done with 9-30 participants (Wade, 2012).  But, a key requirement of scenario planning is for everyone to understand that knowing the future is impossible and yet people want to know where the future could go (Wade, 2014).  However, it is important to note that scenarios are not predictions; scenarios only illuminate different ways the future may unfold (Wade, 2012)!

Therefore, this tool to come up with an approach that is creative, yet methodological, that would help spell out some of the future scenarios that could happen has ten steps (Wade, 2012; Wade, 2014):

  • Framing the challenge
  • Gathering information
  • Identifying driving forces
  • Defining the future’s critical “either/or” uncertainties
  • Generating the scenarios
  • Fleshing them out and creating story lines
  • Validating the scenarios and identifying future research needs
  • Assessing their implications and defining possible responses
  • Identifying signposts
  • Monitoring and updating the scenarios as times goes on

However, in a talk Wade (2014), distilled his 10 step process, to help cover the core steps in scenario planning:

  • Create a brainstorming session to identify as many of the driving force(s) or trend(s) that could have an impact on the problem at hand? Think of any trend or force (direct, indirect, or very indirect) that would have any effect in any way and any magnitude to the problem and they could fall under the following categories:
    • Political
    • Economical
    • Environmental
    • Societal
    • Technological
  • Next, the group must understand the critical uncertainties in the future, from the overwhelming list. There are three types of uncertainties:
    • Some forces have a very low impact but very in uncertainty called secondary elements.
    • Some forces have a very high impact but low uncertainty called predetermined elements.
    • Some forces have a very high impact and high uncertainty call critical uncertainties.
  • Subsequently, select the top two most critical uncertainties and model the most extreme cases of each outcome, it is “either … or …”. They must be contrasting landscapes from each other. Place one critical uncertainty’s either/or in one axis, and the other on the other axis.
  • Finally, the group should describe the different types of scenarios. What would be the key challenges and key issues would be faced in either of these four different scenarios? How should the responses look like?  What are the opportunities and the challenges will be faced? This helps the group to strategically plan and find a way to potentially innovate in this landscape, to outthink their competitors (Wade, 2014)?

Advantages (Frum, 2013; Wade, 2012; Wade, 2014):

+ Focuses on the top two most critical uncertainties to drive simplicity

+ Helps define the extremes in the four different Landscapes and their unique Challenges, Responses, and Opportunities to innovate to create a portfolio of future scenarios

+ An analytical planning method helping to discover the Strengths, Weaknesses, Opportunities, and Threats affecting each scenario

+ Helps you focus on the players in each landscape: competitors, customers, suppliers, employees, key stakeholders, etc.

Disadvantages (Wade, 2012; Wade, 2014):

– No one has a crystal ball

– More time consuming than traditional forecasting

– Only focuses on 2 of the most critical uncertainties, in the real world there are more critical uncertainties needed for analysis.

References

Adv Quant: Bayesian Analysis

Uncertainty in making decisions

Generalizing something that is specific from a statistical standpoint, is the problem of induction, and that can cause uncertainty in making decisions (Spiegelhalter & Rice, 2009). Uncertainty in making a decision could also arise from not knowing how to incorporate new data with old assumptions (Hubbard, 2010).

According to Hubbard (2010) conventional statistics assumes:

(1)    The researcher has no prior information about the range of possible values (which is never true) or,

(2)    The researcher does have prior knowledge that the distribution of the population and it is never any of the messy ones (which is not true more often than not)

Thus, knowledge before data collection and the knowledge gained from data collection doesn’t tell the full story until they are combined, hence the need for Bayes’ analysis (Hubbard, 2010).  Bayes’ theory can be reduced to a conditional probability that aims to take into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).  Bayesian analysis avoids overconfidence and underconfidence from ignoring prior data or ignoring new data (Hubbard, 2010), through the implementation of the equation below:

 eq4                           (1)

Where P(hypothesis|data) is the posterior data, P(hypothesis) is the true probability of the hypothesis/distribution before the data is introduced, P(data) marginal probability, and P(data|hypothesis) is the likelihood that the hypothesis/distribution is still true after the data is introduced (Hubbard, 2010; Smith, 2015; Spiegelhalter & Rice, 2009; Yudkowsky, 2003).  This forces the researcher to think about the likelihood that different and new observations could impact a current hypothesis (Hubbard, 2010). Equation (1) shows that evidence is usually a result of two conditional probabilities, where the strongest evidence comes from a low probability that the new data could have led to X (Yudkowsky, 2003).  From these two conditional probabilities, the resultant value is approximately the average from that of the prior assumptions and the new data gained (Hubbard, 2010; Smith, 2015).  Smith (2015) describe this approximation in the following simplified relationship (equation 2):

 eq5.PNG                                            (2)

Therefore, from equation (2) the type of prior assumptions influence the posterior resultant. Prior distributions come from Uniform, Constant, or Normal distribution that results in a Normal posterior distribution and a Beta or Binomial distribution results in a Beta posterior distribution (Smith, 2015).  To use Bayesian Analysis one must take into account the analysis’ assumptions.

Basic Assumptions of Bayesian Analysis

Though these three assumptions are great to have for Bayesian Analysis, it has been argued that they are quite unrealistic when real life data, particularly unstructured text-based data (Lewis, 1998; Turhan & Bener, 2009):

  • Each of the new data samples is independent of each other and identically distributed (Lewis, 1998; Nigam & Ghani, 2000; Turhan & Bener, 2009)
  • Each attribute has equation importance (Turhan & Bener, 2009)
  • The new data is compatible with the target posterior (Nigam & Ghani, 2000; Smith 2015).

Applications of Bayesian Analysis

There are typically three main situations where Bayesian Analysis is used (Spiegelhalter, & Rice, 2009):

  • Small data situations: The researcher has no choice but to include prior quantitative information, because of a lack of data, or lack of a distribution model.
  • Moderate size data situations: The researcher has multiple sources of data. They can create a hierarchical model on the assumption of similar prior distributions
  • Big data situations: where there are huge join probability models, with 1000s of data points or parameters, which can then be used to help make inferences of unknown aspects of the data

Pros and Cons

Applying Bayesian Analytics to data has its advantages and disadvantages.  Those Advantages and Disadvantages with Bayesian Analysis as identified by SAS (n.d.) are:

Advantages

+    Allows for a combination of prior information with data, for a strong decision-making

+    No reliance on asymptotic approximation, thus the inferences are conditional on the data

+    Provides easily interpretive results.

Disadvantages

– Posteriors are heavily influenced by their priors.

– This method doesn’t help the researcher to select the proper prior, given how much influence it has on the posterior.

– Computationally expensive with large data sets.

The key takeaway from this discussion is that the prior knowledge can heavily influence the posterior, which can easily be seen in equation (2).  That is because knowledge before data collection and the knowledge gained from data collection doesn’t tell the full story unless they are combined.

Reference