FUTURING & INNOVATION: WHAT IS INNOVATION?

One could define Innovation as an idea, value, service, technology, method, or thing that is new to an individual, a family, a firm, a field, an industry, or a country (Jeryaraj & Sabhewal, 2014; Rogers, 1962; Rogers, 2010; Sáenz-Royo, Gracia-Lázaro, & Moreno, 2015). Based on this definition above an invention can be seen as an innovation, but not all innovations are inventions (Robertson, 1967).  Also, even though something may not be considered as an innovation by one entity, it can still be considered as innovative if adopted by a completely different entity (Newby, Nguyen, & Waring, 2014).

Innovation moving from one entity to another can be considered as Diffusion of innovation.  Diffusion of Innovation is a theory that is concerned with the why, what, how, and rate of innovation dissemination and adoption between entities, which are carried out through different communication channels over a period of time (Ahmed, Lakhani, Rafi, Rajkumar, & Ahmed, 2014; Bass, 1969; Robertson, 1967; Rohani & Hussin, 2015; Rogers, 1967; Rogers 2010).  However, there are possible forces that can act on an innovation that can influence the likelihood of the innovation success, for example financial, technological, cultural, economical, legal, ethical, temporal, social, global, national, local, etc.  Therefore, when viewing a new technology or innovation for the future, one must think critically about it and evaluate it from different forces/lenses.

Resources:

  • Ahmed, S., Lakhani, N. A., Rafi, S. K., Rajkumar, & Ahmed, S. (2014). Diffusion of innovation model of new services offerings in universities of karachi.International Journal of Technology and Research, 2(2), 75-80.
  • Bass, F. M. (1969). A new product growth for model consumer durables. Management science15(5), 215-227.
  • Jeyaraj, A., & Sabherwal, R. (2014). The bass model of diffusion: Recommendations for use in information systems research and practice.JITTA : Journal of Information Technology Theory and Application, 15(1), 5-30.
  • Newby, M., Nguyen, T.,H., & Waring, T.,S. (2014). Understanding customer relationship management technology adoption in small and medium-sized enterprises. Journalof Enterprise Information Management, 27(5), 541.
  • Robertson, T. S. (1967). The process of innovation and the diffusion of innovation. The Journal of Marketing, 14-19.
  • Rogers, E. M. (1962). Diffusion of innovations. (1st ed.). New York: Simon and Schuster.
  • Rogers, E. M. (2010). Diffusion of innovations. (4st ed.). New York: Simon and Schuster.
  • Rohani, M. B., & Hussin, A. R. C. (2015). An integrated theoretical framework for cloud computing adoption by universities technology transfer offices (TTOs).Journal of Theoretical and Applied Information Technology,79(3), 415-430.
  • Sáenz-Royo, C., Gracia-Lázaro, C., & Moreno, Y. (2015). The role of the organization structure in the diffusion of innovations.PLoS One, 10(5). doi: http://dx.doi.org/10.1371/journal.pone.0126078

Big Data Analytics: Future Predictions?

Big data analytics and stifling future innovation?

One way to make a prediction about the future is to understand the current challenges faced in certain parts of a particular field.  In the case of big data analytics, machine learning analyzes data from the past to make a prediction or understanding of the future (Ahlemeyer-Stubbe & Coleman, 2014).  Ahlemeyer-Stubbe and Coleman (2014), argued that learning from the past can hinder innovation.  Although Basole, Seuss, and Rouse (2013), studied past popular IT journal articles to see how the field of IT is evolving, and in Yang, Klose, Lippy,  Barcelon-Yang, and Zhang, (2014) they conclude that analyzing current patent information can lead to discovering trends, and help provide companies actionable items to guide and build future business strategies around a patent trend.  The danger of stifling innovation per Ahlemeyer-Stubbe and Coleman (2014), comes from when we consider a situation of only relying on past data and experiences and not allowing for experiencing or trying anything new.  An example is like trying to optimize a horse-drawn carriage; then the automobile will never have been invented (Ahlemeyer-Stubbe & Coleman, 2014).   This example is a very bad analogy.  We should not focus on only collecting data on one item, but its tangential items as well.  We should focus on collecting a wide range of data from different fields and different sources, to allow for new patterns to form, connections to be made, which can promote innovation (Basole et al. 2013).

Future of Health Analytics:

Another way to analyze the future is to dream big or from a movie (Carter, Farmer, and Siegel, 2014). What if we could analyze our blood daily to aid in tracking our overall health, besides the daily blood sugar levels data that most diabetics are accustom to?  The information generated from here can aid in generating a healthier lifestyle.  Currently, doctors aid patients in their care with their diet and monitor their overall health.  When you are going home, this care disappears.  But, constant monitoring may help outpatient care and daily living.  Alerts could be sent to your doctor or to other family members if certain biomarkers get to a critical threshold.  This could aid in better care, allowing people’s social network to help them keep accountable in making healthy life and lifestyle choices, and possibly lessen the time between symptom detection to emergency care in severe cases (Carter, Farmer, and Siegel, 2014).

Generating revenue from analyzing consumers:

Soon, it is not enough to conduct item affinity analysis (market basket analysis).  Item affinity (market basket analysis) uses rules-based analytics to understand what items frequently co-occur during transactions (Snowplow Analytics, 2016). Item affinity is similar to the Amazon.com current method to drive more sales through getting their customers to consume more.  However, what if we started to look at what a consumer intends to buy (Minelli, Chambers, and Dhiraj, 2013)? Analyzing data from consumer product awareness, brand awareness, opinion (sentiment analysis), consideration, preferences, and purchases from a consumer’s multiple social media platforms account in real time can allow marketers to create the perfect advertisement (Minelli et al., 2013).  Establishing the perfect advertisement will allow companies to gain a bigger market share, or to lure customers to their product and/or services from their competitors.  According to Minelli et al. (2013) predicted that companies in the future should be moving towards:

  • Data that can be refreshed every second
  • Data validation exists in real time and alerts sent if something is wrong before data is published in aiding data driven decisions
  • Executives will receive daily data briefs from their internal processes and from their competitors to allow them to make data-driven decisions to increase revenue
  • Questions that were raised in staff meetings or other organizational meetings can be answered in minutes to hours, not weeks
  • A cultural change in companies where data is easily available and the phrase “let me show you the facts” can be easily heard amongst colleagues

Big data analytics can affect many other areas as well, and there is a whole new world opening up to this.  More and more companies and government agencies are hiring data scientists, because they don’t just see the current value that these scientists bring, but they see their potential value 10-15 years from now.  Thus, the field is expected to change as more and more talent is being recruited into the field of big data analytics.

References:

Ahlemeyer-Stubbe, A., & Coleman, S.  (2014). A Practical Guide to Data Mining for Business and Industry. Wiley-Blackwell. VitalBook file.

Basole, R. C., Seuss, D. C., & Rouse, W. B. (2013). IT innovation adoption by enterpirses: knowledge discovery through text analyztics. Decision Support Systems V(54). 1044-1054.

Carter, K.  B., Farmer, D., Siegel, C. (2014). Actionable Intelligence: A Guide to Delivering Business Results with Big Data Fast!. John Wiley & Sons P&T. VitalBook file.

Minelli, M., Chambers, M., Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.

Snowplow Analytics (2016). Market basket analysis: identifying products and content that go well together. Retrieved from http://snowplowanalytics.com/analytics/recipes/catalog-analytics/market-basket-analysis-identifying-products-that-sell-well-together.html

Yang, Y. Y., Klose, T., Lippy, J., Barcelon-Yang, C. S. & Zhang, L. (2014). Leveraging text analytics in patent analysis to empower business decisions – a competitive differentiation of kinase assay technology platforms by I2E text mining software. World Patent Information V(39). 24-34.

Big Data Analytics: Career Prospects

Masters and Doctoral graduates have some advantages over Undergraduates, because they have done research or capstones involving big datasets, they can explain the motivations and reasoning behind the work (chapter 1 & 2 of the dissertation), they can learn and adapt quickly (chapter 3 reflects what you have learned and how you will apply it), and they can think critically about problems (chapter 4 & 5 of the dissertation).  Doctoral student, work on a problem for multiple months/years to see a solution (filling in a gap in the knowledge) that they couldn’t dream of seeing as incomplete (or unfillable).  But, to prepare best for a data science position or big data position, the doctoral shouldn’t be purely theoretical, and should contain an analysis of huge datasets.  Based on my personal analysis, I have noticed that when applying for a senior level position or a team lead position in data science, a doctorate gives you an additional three years of experience on top of what you have already.  Whereas if you lack a doctorate, you need a Master’s degree and three years of experience to be considered for that senior level position or a team lead position in data science.

Master levels courses in big data help build a strong mathematical, statistical, computational, and programming skills. Doctorate level courses help you learn and push the limits of knowledge in all these above mentioned fields, but also aid in becoming a domain expert in a particular area in data science.  Commanding that domain expertise, which is what you get through going through a doctoral program, can make you more valuable in the job market (Lo, n.d.).  Being more valuable in the job market can allow you to demand more in compensation.  Multiple sources of can quote multiple ranges for salaries, mostly because, this field has yet to be standardized (Lo, n.d.).  Thus, I would only provide two sources for salary ranges.

According to Columbus (2014), jobs that involve big data could include Big Data Solution Architect, Linux Systems and Big Data Engineer, Big Data Platform Engineer, Lead Software Engineer, Big Data (Java, Hadoop, SQL) have the following salary statistics:

  • Q1: $84,650
  • Median: $103,000
  • Q3: $121,300

Columbus (2014) also stated that it is very difficult to find the right people for an open requisite and that most requisites remain open for 47 days.  According to Columbus (2014), the most wanted skills for analytics data jobs based on of requisition postings in the field are: in Python (96.90% growth in demand in the past year), Linux and Hadoop (with 76% growth in demand, each).

Lo (n.d.) states that individuals with just a BS or MS degree and no full-time work experience should expect $50-75K whereas data scientist with experience can command up from $65-110K.

  • Data scientist can earn $85-170K
  • Data science/analytics managers can earn $90-140K for 1-3 direct reports
  • Data science/analytics managers can earn $130-175K for 4-9 direct reports
  • Data science/analytics managers can earn $160-240K for 10+ direct reports
  • Database Administrators can earn $50-120K, which varies upwards per more experience
  • Junior Big data engineers can earn $79-115K
  • Domain Expert Big data engineers can earn $100-165K

One way to look for opportunities in the field that are currently available is looking into the Gartner’s Magic Quadrant for Business Intelligence and Analytics Platforms (Parenteau et al., 2016). If you want to help push a tool into a higher ease of execution and completeness of vision as a data scientist consider employment in: Pyramid Analytics, Yellowfin, Platfora, Datawatch, Information Builders, Sisense, Board International, Salesforce, GoodData, Domo, Birst, SAS, Alteryx, SAP, MicroStrategy, Logi Analytics, IBM, ClearStory Data, Pentaho, TIBCO Software, BeyondCore, Qlik, Microsoft, and Tableau.  That is one way to look at this data.  Another way to look at this data is to see which tools are the best in the field and (Tableau, Qlik, Microsoft, with SAS Birst, Alterxyx, and SAP following behind) and learn those tools to to become more marketable.

Resources

Big Data Analytics: POTUS Report

The aims of big data analytics are for data scientist to fuse data from various data sources, various data types, and in huge amounts so that the data scientist could find relationships, identify patterns, and find anomalies.  Big data analytics can help provide either a descriptive, prescriptive, or predictive result to a specific research question.  Big data analytics isn’t perfect, and sometimes the results are not significant, and we must realize that correlation is not causation.  Regardless, there are a ton of benefits from big data analytics, and this is a field where policy has yet to catch up to the field to protect the nation from potential downsides while still promoting and maximizing benefits.

Policies for maximizing benefits while minimizing risk in public and private sector

In the private sector, companies can create detailed personal profiles will enable personalized services from a company to a consumer.  Interpreting personal profile data would allow a company to retain and command more of the market share, but it can also leave room for discrimination in pricing, services quality/type, and opportunities through “filter bubbles” (Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  Policy recommendation should help to encourage de-identifying personally identifiable information to a point that it would not lead to re-identification of the data. Current policies for the private sector for promoting privacy are (Podesta, et al., 2014):

  • Fair Credit Reporting Act, helps to promote fairness and privacy of credit and insurance information
  • Health insurance Portability and Accountably Act enables people to understand and control how personal health data is used
  • Gramm-Leach-Bliley Act, helps consumers of financial services have privacy
  • Children’s Online Privacy Protection Act minimizes the collection/use of children data under the age of 13
  • Consumer Privacy bill of rights is a privacy blueprint that aids in allowing people to understand what their personal data is being collected and used for that are consistent with their expectation.

In the public sector, we run into issues, when the government has collected information about their citizens for one purpose, to eventually, use that same citizen data for a different purpose (Podesta, et al., 2014).  This has the potential of the government to exert power eventually over certain types of citizens and tamper civil rights progress in the future.  Current policies in the public sector are (Podesta, et al., 2014):

  • The Affordable Care Act allows for building a better health care system from a “fee-for-service” program to a “fee-for-better-outcomes.” This has allowed for the use of big data analytics to promote preventative care rather than emergency care while reducing the use of that data to eliminate health care coverage for “pre-existing health conditions.”
  • The Family Education Rights and Privacy Act, the Protection of Pupil Rights Amendment and the Children’s Online Privacy Act help seal children educational records to prevent misuse of that data.

Identifying opportunities for big data in the economy, health, education, safety, energy-efficiency

In the economy, the use of the internet of things to equip parts of product with sensors to help monitor and transmit live, thousands of data points for sending alerts.  These alerts can tell us when maintenance is needed, for which part and where it is located, making the entire process save time and improving overall safety(Podesta, et al., 2014).

In medicine, the use of predictive analytics could be used to identify instances of insurance fraud, waste, and abuse, in real time saving more than $115M per year (Podesta, et al., 2014).  Another instance of using big data is for studying neonatal intensive care, to help use current data to create prescriptive results to determine which newborns are likely to come into contact with which infection and what would that outcome be (Podesta, et al., 2014).  Monitoring newborn’s heart rate and temperature along with other health indicators can alert doctors of an onset of an infection, to prevent it from getting out of hand. Huge amounts of genetic data sets are helping locate genetic variant to certain types of genetic diseases that were once hidden in our genetic code (Podesta, et al., 2014).

With regards to national safety and foreign interests, data scientist and data visualizers have been using data gathered by the military, to help commanders solve real operational challenges in the battlefield (Podesta, et al., 2014).  Using big data analytics on satellite data, surveillance data, and traffic flow data through roads, are making it easier to detect, obtain, and properly dispose of improvised explosive devices (IEDs).  The Department of Homeland Security is aiming to use big data analytics to identify threats as they enter the country and people of higher than the normal probability to conduct acts of violence within the country (Podesta, et al., 2014). Another safety-related used of big data analytics is the identification of human trafficking networks through analyzing the “deep web” (Podesta, et al., 2014).

Finally for energy-efficiency, understanding weather patterns and climate change, can help us understand our contribution to climate change based on our use of energy and natural resources. Analyzing traffic data, we can help improve energy efficiency and public safety in our current lighting infrastructure by dimming lights at appropriate times (Podesta, et al., 2014).  Energy efficiencies can be maximized within companies using big data analytics to control their direct, and indirect energy uses (through maximizing supply chains and monitoring equipment).  Another way we are moving to a more energy efficient future is when the government is partnering with the electric utility companies to provide businesses and families access to their personal energy usage in an easy to digest manner to allow people and companies make changes in their current consumption levels (Podesta, et al., 2014).

Protecting your own privacy outside of policy recommendation

In this report it is suggested that we can control our own privacy through using the browse in private function in most current internet browsers, this would help prevent the collection of personal data (Podesta, et al., 2014). But, this private browsing varies from internet browser to internet browser.  For important information like being denied employment, credit or insurance, consumers should be empowered to know why they were denied and should ask for that information (Podesta, et al., 2014).  Find out the reason why can allow people to address those issues in order to persevere in the future.  We can encrypt our communications as well, in order to protect our privacy, with the highest bit protection available.  We need to educate ourselves on how we should protect our personal data, digital literacy, and know how big data could be used and abused (Podesta, et al., 2014).  While we wait for currently policies to catch up with the time, we actually have more power on our own data and privacy than we know.

 

Reference:

Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J. & Zients,  J. (2014). Big Data: Seizing Opportunities, Preserving Values.  Executive Office of the President. Retrieved from https://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf

Big Data Analytics: Crime Fighting

Case Study: Miami-Dade Police Department: New patterns offer breakthroughs for cold cases. 

Introduction:

Tourism is key to South Florida, bringing in $20B per year in a county of 2.5M people.  Robbery and the rise of other street crimes can hurt tourism and a 1/3 of the state’s sale tax revenue.  Thus, Lt. Arnold Palmer from the Robbery Investigation Police Department of Miami-Dade County teamed up with IT Services Bureau staff and IBM specialist to develop Blue PALMS (Predictive Analytics Lead Modeling Software), to help fight crime and protect the citizens and tourist to Miami-Dade County. When testing the tool it has achieved a 73% success rate when tested on 40 solved cases. The tool was developed because most crimes are usually committed by the same people who committed previous crimes.

 Key Problems:

  1. Cold cases needed to be solved and finally closed. Besides relying on old methods (mostly people skills and evidence gathering), patterns still could be missed, by even the most experienced officers.
  2. Other crimes like, robbery happen in predictable patterns (times of the day and location), which is explicit knowledge amongst the force. So, a tool shouldn’t tell them the location and the time of the next crime; the police need to know who did it, so a narrowed down list of who did it would help.
  3. The more experienced police officers are retiring, and their experience and knowledge leave with them. Thus, the tool that is developed must allow junior officers to ask the same questions of it and get the same answers as they would from asking those same questions to experienced officers.  Fortunately, the opportunity here is that newer officers come in with an embracing technology whenever they can, whereas veteran officers tread lightly when it comes to embracing technology.

Key Components to Success:

It comes to buy-in. Lt. Palmer had to nurture top-down support as well as buy-in from the bottom-up (ranks).  It was much harder to get buy-in from more experienced detectives, who feel that the introduction of tools like analytics, is a way to tell them to give up their long-standing practices and even replace them.  So, Lt. Palmer had sold Blue PALMS as “What’s worked best for us is proving [the value of Blue PALMS] one case at a time, and stressing that it’s a tool, that it’s a compliment to their skills and experience, not a substitute”.  Lt. Palmer got buy-in from a senior and well-respected officer, by helping him solve a case.  The senior officer had a suspect in mind, and after feeding in the data, the tool was able to predict 20 people that could have done it in an order of most likely.  The suspect was on the top five, and when apprehended, the suspect confessed.  Doing, this case by case has built the trust amongst veteran officers and thus eventually got their buy in.

 Similar organizations could benefit:

Other policing counties in Florida, who have similar data collection measures as Miami-Dade County Police Departments would be a quick win (a short-term plan) for tool adoption.  Eventually, other police departments in Florida and other states can start adopting the tool, after more successes have been defined and shared by fellow police officers.  Police officers have a brotherhood mentality and as acceptance of this tool grows. Eventually it will reach critical mass and adoption of the tool will come much more quickly than it does today.  Other places similar to police departments that could benefit from this tool is firefighters, other emergency responders, FBI, and CIA.

June 2020 Editorial Piece:

Please note, that the accuracy of this crime-fighting model is based on the data coming in. Currently, the data that is being fed into these systems are biased towards people of color and the Black community, even though crime rates are not dependent on race (Alexander, 2010; Kendi, 2019; Oluo, 2018). If the system that generated the input data is biased towards people of color and Black people, when used by machine learning, it will create a biased predictive model. Alexander (2010) and Kendi (2019) stated that historically some police departments tend to prioritize and surveillance communities of color more than white communities. Thus, officers would accidentally find more crime in communities of color than white communities (confirmation bias), which can then feed an unconscious bias in the police force about these communities (halo and horns effect). Another, point mentioned in both Kendi (2019) and Alexander (2010), is we may have laws in the books but they are not applied equally among all races, some laws and sentencing guidelines are harsher on people of color and the Black community. Therefore, we must rethink how we are using these types of tools and what data is being fed into the system, before using them as a black-box predictive system. Finally, I want to address the comment mentioned above “The tool was developed because most crimes are usually committed by the same people who committed previous crimes.” This issue speaks more about mass incarceration, private prisons, and school to prison pipeline issues (Alexander, 2010). Addressing these issues should be a priority, to not create racist algorithms, along with allowing returning citizens to have access to opportunities and fully restored citizen rights so that “crime” can be reduced. However, these issues alone are out of the scope of this blog post.

 Resources:

Big Data Analytics: Open-Sourced Tools

Here are three open source text mining software tools for analyzing unstructured big data:

  1. Carrot2
  2. Weka
  3. Apache OpenNLP.

One of the great things about these three software tools is that they are free.  Thus, there is no cost per each software solution.

 Carrot2

A Java based code, which also has a native integration with PHP, and C#/.NET API (Gonzalez-Aguilar & Ramirez Posada, 2012).  Carrot2 can organize a collection of documents into categories based on themes in a visual manner; it can also be used as a web clustering engine. Carpineto, Osinski, Romano, and Weiss (2009) stated that web clustering search engines like Carrot2 help you with fast subtopic retrievals, (i.e. searching for tiger, you can get tiger woods, tigers, Bengals, Bengals football team, etc.), Topic exploration (through a cluster hierarchy), and alleviation information overlook (does more than the first page of results search). The algorithms it uses for categorization is Lingo (Lingo3G), K-mean, and STC, which can support multiple language clustering, synonyms, etc. (Carrot, n.d.).  This software can be used online instead of regular search engines as well (Gonzalez-Aguilar & Ramirez Posada, 2012).  Gonzalez-Aguilar and Ramirez Posada (2012) explain that the interface has three phases for processing information: entry, filtration, and exit.  It represents the cluster data in three visual formats: Heatmap, Network, and pie chart.

The disadvantage of this tool is that it only does clustering analysis, but its advantage is that it can be applied to a search engine to facilitate faster and more accurate searches through its subtopic analysis.  If you would like to use Carrot2 as a search engine, go to http://search.carrot2.org/stable/search and try it out.

Weka

It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (Weka, n.d). Weka can be applied to big data (Weka, n.d.) and SQL Databases (Patel & Donga, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if you can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage.

 Apache OpenNLP

A Java code conventional machine learning toolkit, with tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and conference resolution (OpenNLP, n.d.) OpenNLP works well with the NetBeans and Eclipse IDE, which helps in the development process.  This tool has dependencies on Maven, UIMA Annotators, and SNAPSHOT.

The advantage of OpenNLP is that specification of rules, constraints, and lexicons don’t need to be entered in manually. Thus, it is a machine learning method which aims to maximize entropy (Buyko, Wermter, Poprat, & Hahn, 2006).  Maximizing entropy allows for collect facts consistently and uniformly.  When the sentence splitter, tokenization, part-of-speech tagging, named entity extraction, chunking, parsing, and conference resolution was tested on two medical corpora, accuracy was up in the high 90%s (Buyko et al., 2006).

This software has high accuracy as its advantage, but also produces quite a bit of false negatives which is its disadvantage.   In the sentence splitter function, it picked up literature citations, and in tokenization, it took specialized characters “-” and “/” (Buyko et al., 2006).

 References:

  • Buyko, E., Wermter, J., Poprat, M., & Hahn, U. (2006). Automatically adapting an NLP core engine to the biology domain. In Proceedings of the Joint BioLINK-Bio-Ontologies Meeting. A Joint Meeting of the ISMB Special Interest Group on Bio-Ontologies and the BioLINK Special Interest Group on Text Data M ining in Association with ISMB (pp. 65-68).
  • Carpineto, C., Osinski, S., Romano, G., and Weiss, D. 2009. A survey of web clustering engines. ACM Comput. ´ Surv. 41, 3, Article 17 (July 2009), 38 pages. DOI = 10.1145/1541880.1541884 http://doi.acm.org/10.1145/1541880.1541884
  • Carrot (n.d.) Open source framework for building search clustering engines. Retrieved from http://project.carrot2.org/index.html
  • Gonzalez-Aguilar, A. AND Ramirez-Posada, M. (2012): Carrot2: Búsqueda y visualización de la información (in Spanish). El Profesional de la Informacion. Retrieved from http://project.carrot2.org/publications/gonzales-ramirez-2012.pdf
  • openNLP (n.d.) The Apache Software Foundation: OpenNLP. Retrieved from https://opennlp.apache.org/
  • Weka (n.d.) Weka 3: Data Mining Software in Java. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).

Big Data Analytics: R

R is a powerful statistical tool that can aid in data mining.  Thus, it has huge relevance in the big data arena.  Focusing on my project, I have found that R has a text mining package [tm()].

Patal and Donga (2015) and Fayyad, Piatetsky-Shapiro, & Smyth, (1996) say that the main techniques in Data Mining are: anomaly detection (outlier/change/deviation detection), association rule learning (relationships between the variables), clustering (grouping data that are similar to another), classification (taking a known structure to new data), regressions (find a function to describe the data), and summarization (visualizations, reports, dashboards). Whereas, According to Ghosh, Roy, & Bandyopadhyay (2012), the main types of Text Mining techniques are: text categorization (assign text/documents with pre-defined categories), text-clustering (group similar text/documents together), concept mining (discovering concept/logic based ideas), Information retrieval (finding the relevant documents per the query), and information extraction (id key phrases and relationships within the text). Meanwhile, Agrawal and Batra (2013) add: summarization (compressed representation of the input), assessing document similarity (similarities between different documents), document retrieval (id and grabbing the most relevant documents), to the list of text mining techniques.

We use the “library(tm)” to aid in transforming text, stem words, build a term-document matrix, etc. mostly for preprocessing the data (RStudio pubs, n.d.). Based on RStudio pubs (n.d.) some text preprocessing steps and code are as follows:

  • To remove punctuation:

docs <- tm_map(docs, removePunctuation)

  • To remove special characters:

for(j in seq(docs))      {        docs[[j]] <- gsub(“/”, ” “, docs[[j]])        docs[[j]] <- gsub(“@”, ” “, docs[[j]])        docs[[j]] <- gsub(“\\|”, ” “, docs[[j]])     }

  • To remove numbers:

docs <- tm_map(docs, removeNumbers)

  • Convert to lowercase:

docs <- tm_map(docs, tolower)

  • Removing “stopwords”/common words

docs <- tm_map(docs, removeWords, stopwords(“english”))

  • Removing particular words

docs <- tm_map(docs, removeWords, c(“department”, “email”))

  • Combining words that should stay together

for (j in seq(docs)){docs[[j]] <- gsub(“qualitative research”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative studies”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative analysis”, “QDA”, docs[[j]])docs[[j]] <- gsub(“research methods”, “research_methods”, docs[[j]])}

  • Removing coming word endings

library(SnowballC)   docs <- tm_map(docs, stemDocument)

Text mining algorithms could consist of but are not limited to (Zhao, 2013):

  • Summarization:
    • Word clouds use “library (wordcloud)”
    • Word frequencies
  • Regressions
    • Term correlations use “library (ggplot2) use functions findAssocs”
    • Plot word frequencies Term correlations use “library (ggplot2)”
  • Classification models:
    • Decision Tree “library (party)” or “library (rpart)”
  • Association models:
    • Apriori use “library (arules)”
  • Clustering models:
    • K-mean clustering use “library (fpc)”
    • K-medoids clustering use “library(fpc)”
    • Hierarchical clustering use “library(cluster)”
    • Density-based clustering use “library (fpc)”

As we can see, there are current libraries, functions, etc. to help with data preprocessing, data mining, and data visualization when it comes to text mining with R and RStudio.

Resources:

Big Data Analytics: Installing R

I didn’t have any problems with the installation thanks to a video produced by Dr. Webb (2014).  It is a bigger package than what I thought it would be, so it can take a few minutes to download, depending on your download speed and internet connection. Thus,

(1)    For proper installation of R, you need to have administrative access on your computer.

(2)    Watch this video, to get a step-by-step instructions and an online tutorial to installing R and its graphical Integrated Development Environment (IDE).

  1. Note: The application for R 32x and 64x can be found at http://cran.r-project.org/
  2. Note: The Rstudio free “Desktop” graphical IDE can be found at http://www.rstudio.com/

(3)    Once installed Use the manual for this application at this site: http://cran.r-project.org/doc/manuals/R-intro.html

Once, I installed the software and the graphical IDE, I continued to follow along with the video to use the prepopulated Cars data under the “datasets” Packages, and I got the same result as shown in the video.  I also would like to note that Dr. Webb (2014) also had checked the Packages: “datasets,” “graphics,” “grDevices,” “methods,” and “stats” in the video, which can be hard to see depending on your video streaming resolution.

Resources:

Webb, J. (2014). Installing and Using the “R” Programming Language and RStudio. Retrieved from https://www.youtube.com/watch?v=77PgrZSHvws&feature=youtu.be

Big Data Analytics: Hadoop®

Hadoop® Distributed File System (HFDS):

HFDS big data is broken up into smaller blocks (IBM, n.d.), which can be aggregated like a set of Legos throughout a distributed database system. Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). Blocks are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on your data needs).

Example 1:

An example of HFDS stored data, is to think of a deck of cards, which each card holds information about what it is, value, color, symbol, etc.  HFDS can divide the data into blocks by A, 2, 3 … J, Q, & K, thus each block will hold about four card data each.  Thus, there are 13 distinct data blocks, which have been parsed by their value and placed on 13 different servers.  Let’s also assume I need higher than average availability, so rather than two copies, I need four copies of the J, Q, & K values, and 2 for A, 1, 2 … 10.  This is possible.  Each of the copies could be clustered in similar servers, or each can have one server on its own.  This type of redundancies in my data within HFDS has the benefit of higher availability of my data.  Thus, when I need to analyze my data on my deck of cards, I can say, the important values J, Q, & K have a higher chance of being available than my perceived lower value cards A, 2 … 10.

MapReduce:

MapReduce contains two job types that work in parallel on distributed systems: (1) Mappers which creates & processes transactions on the system by mapping/aggregating data by key values, and (2) Reducers which know what that key value is, will take all those values stored in a map and reduce the data to what is relevant (Hortonworks, 2013 & Sathupadi, 2010). Reducers can work on different keys.  Huge amounts of data are entered into MapReduce, then the Mapper maps the data, then the data is shuffled and sorted before it is reduced.  Once the data is reduced, we get the output that we sought.

IBM’s (n.d.) MapReduce functions using the HFDS will run its procedures on the server in which the data is stored (also known as data locality).  Keeping in mind that HFDS has at least two backup copies, if one server goes down, which can happen, it can continue doing the tasks on the same data on a different server that is working.  This backup system for disaster recovery allows for high data availability.

Example 2:

Adjusted from Sathupadi (2010), is to look at how MapReduce can calculate the sum of all of Harvard Law Students and Medical Students current outstanding school loans per degree type.  Thus, the final output from our example would be Juris Doctorate (JD) Students Current Outstanding School Loan Amount and Latin Legum Magister (LLM) Students Current Outstanding School Loan Amount, and Doctor of Medicine (MD) School Loan Amount and Doctor of Osteopathic Medicine (DO) School Loan Amount.

If I ran this in Hadoop, a single copy of the data can be stored in 50 servers, and thus 50 nodes could be used to process this transaction request in parallel, speeding up the time it would take significantly but not by 50 fold.  The reason as to why not 50 fold is because it takes the time to reduce from mapping and nodes need to talk to each other, which slows down the speed of transaction.  So, running on X amount parallel never really is like saying we are X times faster, in reality, we are X-e times faster (where e is the transaction cost).

The bad data that gets thrown out in the mapper phase would be the Undergraduate Students, Doctorate of Philosophy Students, Master Degree Students, etc.  Only JD, LLM, MD, and DO Students will get one key each assigned to them, keys that are similar to all nodes, so that way the sum of all current outstanding school loan amounts get processed under the correct group.  If data is duplicated at least twice on different servers, if a server were to go down, the MapReduce function will move on to a copy of that data in which can still be mapped and reduced.

 Resources: