Big Data Analytics: Compelling Topics

Big Data and Hadoop:

According to Gray et al. (2005), traditional data management relies on arrays and tables in order to analyze objects, which can range from financial data, galaxies, proteins, events, spectra data, 2D weather, etc., but when it comes to N-dimensional arrays there is an “impedance mismatch” between the data and the database.    Big data, can be N-dimensional, which can also vary across time, i.e. text data (Gray et al., 2005). Big data, by its name, is voluminous. Thus, given the massive amounts of data in Big Data that needs to get processed, manipulated, and calculated upon, parallel processing and programming are there to use the benefits of distributed systems to get the job done (Minelli, Chambers, & Dhiraj, 2013).  Parallel processing allows making quick work on a big data set, because rather than having one processor doing all the work, you split up the task amongst many processors.

Hadoop’s Distributed File System (HFDS), breaks up big data into smaller blocks (IBM, n.d.), which can be aggregated like a set of Legos throughout a distributed database system. Data blocks are distributed across multiple servers. Hadoop is Java-based and pulls on the data that is stored on their distributed servers, to map key items/objects, and reduces the data to the query at hand (MapReduce function). Hadoop is built to deal with big data stored in the cloud.

Cloud Computing:

Clouds come in three different privacy flavors: Public (all customers and companies share the all same resources), Private (only one group of clients or company can use a particular cloud resources), and Hybrid (some aspects of the cloud are public while others are private depending on the data sensitivity.  Cloud technology encompasses Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS).  These types of cloud differ in what the company managers on what is managed by the cloud provider (Lau, 2011).  Cloud differs from the conventional data centers where the company managed it all: application, data, O/S, virtualization, servers, storage, and networking.  Cloud is replacing the conventional data center because infrastructure costs are high.  For a company to be spending that much money on a conventional data center that will get outdated in 18 months (Moore’s law of technology), it’s just a constant sink in money.  Thus, outsourcing the data center infrastructure is the first step of company’s movement into the cloud.

Key Components to Success:

You need to have the buy-in of the leaders and employees when it comes to using big data analytics for predictive, prescriptive or descriptive purposes.  When it came to buy-in, Lt. Palmer had to nurture top-down support as well as buy-in from the bottom-up (ranks).  It was much harder to get buy-in from more experienced detectives, who feel that the introduction of tools like analytics, is a way to tell them to give up their long-standing practices and even replace them.  So, Lt. Palmer had sold Blue PALMS as “What’s worked best for us is proving [the value of Blue PALMS] one case at a time, and stressing that it’s a tool, that it’s a compliment to their skills and experience, not a substitute”.  Lt. Palmer got buy-in from a senior and well-respected officer, by helping him solve a case.  The senior officer had a suspect in mind, and after feeding in the data, the tool was able to predict 20 people that could have done it in an order of most likely.  The suspect was on the top five, and when apprehended, the suspect confessed.  Doing, this case by case has built the trust amongst veteran officers and thus eventually got their buy in.

Applications of Big Data Analytics:

A result of Big Data Analytics is online profiling.  Online profiling is using a person’s online identity to collect information about them, their behaviors, their interactions, their tastes, etc. to drive a targeted advertising (McNurlin et al., 2008).  Profiling has its roots in third party cookies and profiling has now evolved to include 40 different variables that are collected from the consumer (Pophal, 2014).  Online profiling allows for marketers to send personalized and “perfect” advertisements to the consumer, instantly.

Moving from online profiling to studying social media, He, Zha, and Li (2013) stated their theory, that with higher positive customer engagement, customers can become brand advocates, which increases their brand loyalty and push referrals to their friends, and approximately 1/3 people followed a friend’s referral if done through social media. This insight came through analyzing the social media data from Pizza Hut, Dominos and Papa Johns, as they aim to control more of the market share to increase their revenue.  But, is this aiding in protecting people’s privacy when we analyze their social media content when they interact with a company?

HIPAA described how we should conduct de-identification of 18 identifiers/variables that would help protect people from ethical issues that could arise from big data.   HIPAA legislation is not standardized for all big data applications/cases; it is good practice. However, HIPAA legislation is mostly concerned with the health care industry, listing those 18 identifiers that have to be de-identified: Names, Geographic data, Dates, Telephone Numbers, VIN, Fax, Device ID and serial numbers, emails addresses, URLs, SSN, IP address, Medical Record Numbers, Biometric ID (fingerprints, iris scans, voice prints, etc), full face photos, health plan beneficiary numbers, account numbers, any other unique ID number (characteristic, codes, etc), and certifications/license numbers (HHS, n.d.).  We must be aware that HIPAA compliance is more a feature of the data collector and data owner than the cloud provider.

HIPAA arose from the human genome project 25 years ago, where they were trying to sequence its first 3B base pair of the human genome over a 13 year period (Green, Watson, & Collins, 2015).  This 3B base pair is about 100 GB uncompressed and by 2011, 13 quadrillion bases were sequenced (O’Driscoll et al., 2013). Studying genomic data comes with a whole host of ethical issues.  Some of those were addressed by the HIPPA legislation while other issues are left unresolved today.

One of the ethical issues that arose were mentioned in McEwen et al. (2013), for people who have submitted their genomic data 25 years ago can that data be used today in other studies? What about if it was used to help the participants of 25 years ago to take preventative measures for adverse health conditions?  However, ethical issues extend beyond privacy and compliance.  McEwen et al. (2013) warn that data has been collected for 25 years, and what if data from 20 years ago provides data that a participant can suffer an adverse health condition that could be preventable.  What is the duty of the researchers today to that participant?

Resources:

Big Data Analytics: Crime Fighting

Case Study: Miami-Dade Police Department: New patterns offer breakthroughs for cold cases. 

Introduction:

Tourism is key to South Florida, bringing in $20B per year in a county of 2.5M people.  Robbery and the rise of other street crimes can hurt tourism and a 1/3 of the state’s sale tax revenue.  Thus, Lt. Arnold Palmer from the Robbery Investigation Police Department of Miami-Dade County teamed up with IT Services Bureau staff and IBM specialist to develop Blue PALMS (Predictive Analytics Lead Modeling Software), to help fight crime and protect the citizens and tourist to Miami-Dade County. When testing the tool it has achieved a 73% success rate when tested on 40 solved cases. The tool was developed because most crimes are usually committed by the same people who committed previous crimes.

 Key Problems:

  1. Cold cases needed to be solved and finally closed. Besides relying on old methods (mostly people skills and evidence gathering), patterns still could be missed, by even the most experienced officers.
  2. Other crimes like, robbery happen in predictable patterns (times of the day and location), which is explicit knowledge amongst the force. So, a tool shouldn’t tell them the location and the time of the next crime; the police need to know who did it, so a narrowed down list of who did it would help.
  3. The more experienced police officers are retiring, and their experience and knowledge leave with them. Thus, the tool that is developed must allow junior officers to ask the same questions of it and get the same answers as they would from asking those same questions to experienced officers.  Fortunately, the opportunity here is that newer officers come in with an embracing technology whenever they can, whereas veteran officers tread lightly when it comes to embracing technology.

Key Components to Success:

It comes to buy-in. Lt. Palmer had to nurture top-down support as well as buy-in from the bottom-up (ranks).  It was much harder to get buy-in from more experienced detectives, who feel that the introduction of tools like analytics, is a way to tell them to give up their long-standing practices and even replace them.  So, Lt. Palmer had sold Blue PALMS as “What’s worked best for us is proving [the value of Blue PALMS] one case at a time, and stressing that it’s a tool, that it’s a compliment to their skills and experience, not a substitute”.  Lt. Palmer got buy-in from a senior and well-respected officer, by helping him solve a case.  The senior officer had a suspect in mind, and after feeding in the data, the tool was able to predict 20 people that could have done it in an order of most likely.  The suspect was on the top five, and when apprehended, the suspect confessed.  Doing, this case by case has built the trust amongst veteran officers and thus eventually got their buy in.

 Similar organizations could benefit:

Other policing counties in Florida, who have similar data collection measures as Miami-Dade County Police Departments would be a quick win (a short-term plan) for tool adoption.  Eventually, other police departments in Florida and other states can start adopting the tool, after more successes have been defined and shared by fellow police officers.  Police officers have a brotherhood mentality and as acceptance of this tool grows. Eventually it will reach critical mass and adoption of the tool will come much more quickly than it does today.  Other places similar to police departments that could benefit from this tool is firefighters, other emergency responders, FBI, and CIA.

June 2020 Editorial Piece:

Please note, that the accuracy of this crime-fighting model is based on the data coming in. Currently, the data that is being fed into these systems are biased towards people of color and the Black community, even though crime rates are not dependent on race (Alexander, 2010; Kendi, 2019; Oluo, 2018). If the system that generated the input data is biased towards people of color and Black people, when used by machine learning, it will create a biased predictive model. Alexander (2010) and Kendi (2019) stated that historically some police departments tend to prioritize and surveillance communities of color more than white communities. Thus, officers would accidentally find more crime in communities of color than white communities (confirmation bias), which can then feed an unconscious bias in the police force about these communities (halo and horns effect). Another, point mentioned in both Kendi (2019) and Alexander (2010), is we may have laws in the books but they are not applied equally among all races, some laws and sentencing guidelines are harsher on people of color and the Black community. Therefore, we must rethink how we are using these types of tools and what data is being fed into the system, before using them as a black-box predictive system. Finally, I want to address the comment mentioned above “The tool was developed because most crimes are usually committed by the same people who committed previous crimes.” This issue speaks more about mass incarceration, private prisons, and school to prison pipeline issues (Alexander, 2010). Addressing these issues should be a priority, to not create racist algorithms, along with allowing returning citizens to have access to opportunities and fully restored citizen rights so that “crime” can be reduced. However, these issues alone are out of the scope of this blog post.

 Resources: