Big Data Analytics: Installing R

I didn’t have any problems with the installation thanks to a video produced by Dr. Webb (2014).  It is a bigger package than what I thought it would be, so it can take a few minutes to download, depending on your download speed and internet connection. Thus,

(1)    For proper installation of R, you need to have administrative access on your computer.

(2)    Watch this video, to get a step-by-step instructions and an online tutorial to installing R and its graphical Integrated Development Environment (IDE).

  1. Note: The application for R 32x and 64x can be found at http://cran.r-project.org/
  2. Note: The Rstudio free “Desktop” graphical IDE can be found at http://www.rstudio.com/

(3)    Once installed Use the manual for this application at this site: http://cran.r-project.org/doc/manuals/R-intro.html

Once, I installed the software and the graphical IDE, I continued to follow along with the video to use the prepopulated Cars data under the “datasets” Packages, and I got the same result as shown in the video.  I also would like to note that Dr. Webb (2014) also had checked the Packages: “datasets,” “graphics,” “grDevices,” “methods,” and “stats” in the video, which can be hard to see depending on your video streaming resolution.

Resources:

Webb, J. (2014). Installing and Using the “R” Programming Language and RStudio. Retrieved from https://www.youtube.com/watch?v=77PgrZSHvws&feature=youtu.be

Big Data Analytics: Hadoop®

Hadoop® Distributed File System (HFDS):

HFDS big data is broken up into smaller blocks (IBM, n.d.), which can be aggregated like a set of Legos throughout a distributed database system. Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). Blocks are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on your data needs).

Example 1:

An example of HFDS stored data, is to think of a deck of cards, which each card holds information about what it is, value, color, symbol, etc.  HFDS can divide the data into blocks by A, 2, 3 … J, Q, & K, thus each block will hold about four card data each.  Thus, there are 13 distinct data blocks, which have been parsed by their value and placed on 13 different servers.  Let’s also assume I need higher than average availability, so rather than two copies, I need four copies of the J, Q, & K values, and 2 for A, 1, 2 … 10.  This is possible.  Each of the copies could be clustered in similar servers, or each can have one server on its own.  This type of redundancies in my data within HFDS has the benefit of higher availability of my data.  Thus, when I need to analyze my data on my deck of cards, I can say, the important values J, Q, & K have a higher chance of being available than my perceived lower value cards A, 2 … 10.

MapReduce:

MapReduce contains two job types that work in parallel on distributed systems: (1) Mappers which creates & processes transactions on the system by mapping/aggregating data by key values, and (2) Reducers which know what that key value is, will take all those values stored in a map and reduce the data to what is relevant (Hortonworks, 2013 & Sathupadi, 2010). Reducers can work on different keys.  Huge amounts of data are entered into MapReduce, then the Mapper maps the data, then the data is shuffled and sorted before it is reduced.  Once the data is reduced, we get the output that we sought.

IBM’s (n.d.) MapReduce functions using the HFDS will run its procedures on the server in which the data is stored (also known as data locality).  Keeping in mind that HFDS has at least two backup copies, if one server goes down, which can happen, it can continue doing the tasks on the same data on a different server that is working.  This backup system for disaster recovery allows for high data availability.

Example 2:

Adjusted from Sathupadi (2010), is to look at how MapReduce can calculate the sum of all of Harvard Law Students and Medical Students current outstanding school loans per degree type.  Thus, the final output from our example would be Juris Doctorate (JD) Students Current Outstanding School Loan Amount and Latin Legum Magister (LLM) Students Current Outstanding School Loan Amount, and Doctor of Medicine (MD) School Loan Amount and Doctor of Osteopathic Medicine (DO) School Loan Amount.

If I ran this in Hadoop, a single copy of the data can be stored in 50 servers, and thus 50 nodes could be used to process this transaction request in parallel, speeding up the time it would take significantly but not by 50 fold.  The reason as to why not 50 fold is because it takes the time to reduce from mapping and nodes need to talk to each other, which slows down the speed of transaction.  So, running on X amount parallel never really is like saying we are X times faster, in reality, we are X-e times faster (where e is the transaction cost).

The bad data that gets thrown out in the mapper phase would be the Undergraduate Students, Doctorate of Philosophy Students, Master Degree Students, etc.  Only JD, LLM, MD, and DO Students will get one key each assigned to them, keys that are similar to all nodes, so that way the sum of all current outstanding school loan amounts get processed under the correct group.  If data is duplicated at least twice on different servers, if a server were to go down, the MapReduce function will move on to a copy of that data in which can still be mapped and reduced.

 Resources:

 

Big Data Analytics: Cloud Computing

Clouds come in three different privacy flavors: Public (all customers and companies share the all same resources), Private (only one group of clients or company can use a particular cloud resources), and Hybrid (some aspects of the cloud are public while others are private depending on the data sensitivity.

Cloud technology encompasses Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).  These types of cloud differ in what the company managers with respect to what is managed by the cloud provider.  For IaaS the company manages the applications, data, runtime, and middleware, whereas the provider administers the O/S, virtualization, servers, storage, and networking.  For PaaS the company manages the applications, and data, whereas the vendor, administers the runtime, middleware, O/S, virtualization, servers, storage, and networking.  Finally SaaS the provider manages it all: application, data, O/S, virtualization, servers, storage, and networking (Lau, 2011).  This differs from the conventional data centers where the company managed it all: application, data, O/S, virtualization, servers, storage, and networking.

Examples of IaaS are Amazon Web Services, Rack Space, and VMware vCloud.  Examples of PaaS are Google App Engine, Windows Azure Platform, and force.com. Examples of SaaS are Gmail, Office 365, and Google Docs (Lau, 2011).

There are benefits of cloud is this pay-as-you-go business model.  One, the company can pay for as much (SaaS) or as little (IaaS) of the service that they need and how much in space they require. Two, the company can go on an On-Demand model, which businesses can scale up and down as they need (Dikaiakos, Katsaros, Mehra, Pallis, & Vakali, 2009).  For example, if a company would like a development environment for 3 weeks, they can build it up in the cloud for that time period and spend money for using the service for 3 weeks rather than buying a new set of infrastructure and setting up all the libraries.  This can help speed up the development speed in a ton of applications moving forward when you elect the cloud versus buying a new infrastructure.  These models are like renting a car.  Renting a car for what you need, but you are paying for what you use (Lau, 2011).

Replacing Conventional Data Center?

Infrastructure costs are really high.  For a company to be spending that much money on something that will get outdated in 18 months (Moore’s law of technology), it’s just a constant sink in money.  Outsourcing, infrastructure is the first step of company’s movement into the cloud.  However, companies need to understand the different privacy flavors well, because if data is stored in a public cloud, it will be hard to destroy the hardware, because you will destroy not only your data, but other people’s and company’s data.  Private clouds are best for government agencies which may need or require physical destruction of the hardware.  Government agencies may even use hybrid structures, keeping private data in the private clouds and the public stuff in a public cloud.  Companies that contract with the government could migrate to hybrid clouds in the future, and businesses without contracts with the government could go onto a public cloud.  There may always be a need to store the data on a private server, like patents, of KFC’s 7 herbs and spices recipe, but for the majority of the data, personally the cloud may be a grand place to store and work off of.

Note: Companies that do venture into moving into a cloud platform and storing data, they should focus on migrating data and data dictionaries slowly and with uniformity.  Data variables should have the same naming convention, one definition, a list of who is responsible for the data, meta-data, etc.  This would be a great chance for companies, while in migration to a new infrastructure to clean up their data.

Resources: