Data Tools: Artificial Intelligence and Data Analytics

Machine learning, also known as Artificial Intelligence (AI) adds an intelligence layer to big data to handle the bigger sets of data to derive patterns from it that even a team of data scientist would find challenging (Maycotte, 2014; Power, 2015). AI makes their insights not by how machines are programmed, but how the machines perceive the data and take actions from that perception, essentially conducting self-learning (Maycotte, 2014).  Understanding how a machine perceives the big dataset is a hard task, which also makes it hard to interpret the resulting final models (Power, 2015).  AI is even revolutionizing how we understand what intelligence is (Spaulding, 2013).

So what is intelligence

At first, doing arithmetic was thought of as a sign of biological intelligence until the invention of the digital computers, which then shift biological intelligence to be known for logical reasoning, deduction and inferences to eventually fuzzy logic, grounded learning, and reasoning under uncertainty, which is now matched through Bayes Nets probability and current data analytics (Spaulding, 2013). So as humans keep moving the dial of what biological intelligence is to a more complex structure, if it requires high frequency and voluminous data, then it can be matched by AI (Goldbloom, 2016).  Therefore, as our definition of intelligence expands so will drive the need to capture intelligence artificially, driving change in how big datasets are analyzed.

AI on influencing the future of data analytics modeling, results, and interpretation

This concept should help revolutionize how data scientists and statisticians think about which hypotheses to ask, which variables are relevant, how do the resulting outputs fit in an appropriate conceptual model, and why do these patterns hidden in the data help generate the decision outcome forecasted by AI (Power, 2015). To figure out or make sense of these models would require subject matter experts from multiple fields and multiple levels of employment hierarchy analyzing these model outputs because it is through diversity and inclusion of thought will we understand an AI’s analytical insight.

Also, owning data is different from understanding data (Lapowsky, 2014). Thus, AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with, which allows for a data scientist to gain a better understanding of their datasets (Lapowsky, 2014; Power, 2015).

AI on generating datasets and using data analytics for self-improvements

Data scientists currently collected, preprocess, process and analyze big volumes of data regularly to help provide decision makers with insights from the data to make data-driven decisions (Fayyad, Piatetsky-Shapiro, & Smyth, 1996).  From these data-driven decisions, data scientist then measure the outcomes to prove the effectiveness of their insights (Maycotte, 2014).   This analysis on how the results of data-driven decisions, will allow machine learning algorithms to learn from their decisions and actions to create better ways of searching for key patterns in bigger and future datasets. This is an ability of AI to conduct self-learning based off of the results of data analytics through the use of data analytics (Maycotte, 2014). Meetoo (2016), stated that if there is enough data to create accurate rules it is enough to create insights; because machine learning can run millions of simulations against itself to generate huge volumes of data to which to learn from.

AI on Data Analytics Process

AI is a result of the massive amounts of data being collected, the culmination of ideas from the most brilliant computer scientists of our time, and on an IT infrastructure that didn’t use to exist a few years ago (Power, 2015).  Given that data analytics processes include collecting data, preprocessing data, processing data, and analyzing the results, any improvements made for AI on the infrastructure can have an influence on any part of the data analytics process (Fayyad et al., 1996; Power, 2015).  For example, as AI technology begins to learn how to read raw data to turn that into information, the need for most of the current preprocessing techniques for data cleaning could disappear (Minelli, Chambers, & Dhiraj, 2013). Therefore, as AI begins to advance, newer IT infrastructures will be dreamt up and built such that data analytics and its processes can now leverage this new infrastructure, which can also change the way on how big datasets are analyzed.

Resources:

Quant: Getting Lost in the Numbers

It is easy to get lost in numbers when you do quantitative research.
These are suggestions that can help keep the focus on people and organizations when you are dealing with numbers representing them.

In quantitative research, data that is collected is numerical in nature. Rarely is every member of the population studied, and instead a sample from that population is randomly taken to represent that population for analysis in quantitative research (Gall, Gall, & Borg 2006). At the end of the day, the insights gained from this type of research should be impersonal, objective, and generalizable.  To generalize the results of the research the insights gained from a sample of data needs to use the correct mathematical procedures for using probabilities and information, statistical inference (Gall et al., 2006).  Gall et al. (2006), stated that statistical inference is what dictates the order of procedures, for instance, a hypothesis and a null hypothesis must be defined before a statistical significance level, which also has to be defined before calculating a z or t statistic value.

Essentially, a statistical inference allows for quantitative researchers to make inferences about a population.  A population, where researchers must remember where that data was generated and collected from during quantitative research process.  However, it is easy to get lost in the numbers during quantitative research, thus here is a list of some of the ways to keep the focus on the people and organizations when research deal with the numbers that represent their population: To design a quantitative research project, researchers must understand the purpose and rationale of their own research designs and their research methods (Creswell, 2014).  Knowing the purpose and rationale can help the development of a research question(s) and hypothesis.  With a clear research question and hypothesis can a researcher to design and review their data collection from people, organizations, or instruments.  It is when focusing on the methods section that researchers can keep their focus on the people and organizations by identifying the population, consideration of a stratified population before sampling, sampling design and procedures, selection process for the individuals, which variables to study (their name, how they relate to the research question, and collection description) (Creswell, 2014).

  • The numerical data used in the quantitative research was generated and collected from people, a social group, an organizational entity, or an instrument. The numerical value alone does not have any meaning nor value to the research. But, when the numerical value is paired with contextual information, then it provides researchers a wealth of information to conduct their statistical analysis on the data (Ahlemeyer-Stubbe, & Coleman, 2014; Miller, n.d.a.).
  • Remember each data point, row or column represents a person, group, or thing with all its features and bugs. It would be wise to create a metadata file that describes the data points variables to help keep the focus on the people and organizations.  In SPSS, the metadata section is called the “Variable View”, and each person is represented as an entity or row of data in the “Data View” (Field, 2013; Miller, n.d.b.).
  • Data sets are never neutral and theory-free data repositories but require researchers to interpret that data through their personal lenses (Crawford, Miltner, & Gray, 2014). One must gather and analyze data ethically to avoid social and legal concerns. Thus, the researcher must be aware of how their analysis of the data can be used to cause harm to others or help facilitate discriminate against disenfranchised groups of people (Robinson, 2015).

References:

  • Ahlemeyer-Stubbe, A., & Coleman S. (2014). A practical guide to data mining for business and industry. UK, Wiley-Blackwell. VitalBook file.
  • Crawford, K., Miltner, K., & Gray, M. L. (2014). Critiquing Big Data : Politics , Ethics , Epistemology Special Section Introduction. International Journal of Communication, 8, 1663–1672.
  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Gall, M. D., Gall, J., & Borg W. (2006). Educational research: An introduction (8th ed.). Pearson Learning Solutions. VitalBook file.
  • Miller, R. (n.d.a.). Week 1: Central tendency [Video file]. Retrieved from http://breeze.careeredonline.com/p9fynztexn6/?launcher=false&fcsContent=true&pbMode=normal
  • Miller, R. (n.d.b.). Week 2: All about SPSS. [Video file]. Retrieved from http://breeze.careeredonline.com/p99kywtldbw/?launcher=false&fcsContent=true&pbMode=normal
  • Robinson, S. C. (2015). The good, the bad, and the ugly: Applying Rawlsian ethics in data mining marketing. Journal of Mass Media Ethics, 30(1), 19–30. http://doi.org/10.1080/08900523.2014.985297

Quant: Introduction to SPSS

IBM SPSS aids in the entire quantitative analytical process, which aids in gaining insights on your data, to allow for better data-driven decisions (IBM, n.d.).  SPSS allows for the quick statistical practice and analysis of the data, without getting too focused and bogged by the statistical equations (Field, 2013). SPSS allows the end user to graphically tell a story about their data by discovering hidden relationships for pattern analysis through the table, graphs, charts, and maps that are allowing pivoting (IBM, n.d).  This tool also provides high accuracy, flexibility, and advanced statistical procedures which can be made available through the guided user interface or by allowing programmable options such internal command line syntax and external programming interfaces with R, Python, Java, .NET, etc. for automating procedures (IBM, n.d.).  However, Field (2013), warned that software like SPSS, which can automate statistical equations and procedures should not be used without fully understanding the statistical theory.

Variables and how to insert them into SPSS

A variable is a measurable and observed characteristic, attribute, or object which can differ between time, space, entity, person, organization, etc. (Creswell, 2014; Field, 2013). How these variables interact with other variables helps define what type of variable they are.  There are many types of variables such as dependent variables, independent variables, intervening/mediating variables, moderating variables, control variables, confounding variables, and extraneous variables (Creswell, 2014; Field, 2013). Dependent variables measure the outcome variation and are explained and influenced by independent variables (Schumacker, 2014). Thus, the dependent variables depend on the outcomes of the independent variables (Creswell, 2014).   Independent variables which are those that can be manipulated to help explain the dependent variable’s variation (Schumacker, 2014). Thus, the independent variables are the probable cause, influence, or affect the dependent variable (Creswell, 2014).  Intervening/mediating variables stand between the independent and dependent variable as a probable causal link between the two (Creswell, 2014).  Moderating variables are a type of independent variables that influence the direction or strength between the independent and dependent variables (Cresswell, 2014). Control variables are a type of independent variable that is restricted in some way or another to help find possible influences on the dependent variable.  Confounding variables are not measured or observed, but its influences cannot be detected.  Finally, there are extraneous variables are a type of independent variable, which are not controlled in quasi-experimental research and can influence the variation of the dependent variable (Schumacker, 2014)

In SPSS, one could enter in a variable in the data editor through the “Data View” window (see Figure 1) or through the “Variable View” window (see Figure 2).  In the “Data View” data can be entered in the cells below the variable name and new variables could be added by right clicking on the top most cell and selecting “Insert Variable,” though it should be avoided (Field, 2013; Miller, n.d.).  Whereas in the “Variable View” allows the end user to not only add new variables but add defining descriptions and characteristics of the variable (Field, 2013; Miller n.d.).  Every row in “Variable View” is variable and to add a new cell just select the cell below the last variable shown and start typing the variable’s name (Field, 2013).

u1db3f1

Figure 1: SPSS “Data View” on a sample dataset called bodyfat.sav.

u1db3f2

Figure 2: SPSS “Variable View” on a sample dataset called bodyfat.sav.

Data consists of numbers.  Numbers alone do not mean a thing.  The number 3 alone doesn’t mean a thing, however, three apples, three diamonds, 3oC means something. Once numerical data has been collected and entered into SPSS, it must be defined.  It is good practice to define the data in the “Variable View” immediately after collection and population into SPSS, because as time goes on memory can fade, and if the variable is not defined it can easily be forgotten what all those numbers mean.  Thus, defining the meaning to the data through the variable view allows the end user to remember what the data in each column of SPSS is, and tells SPSS how to treat, categorize, analyze, and display the variable. In order to do that the end user would need to enter in the: name of the variable, type of variable (numeric, string, currency, date, Boolean, etc.), width of the variable (number of digits and characters in the cell), decimals (how many decimals are displayed), label (a place holder to write the full name or description of the variable), values (assign numbers for representing groups), missing (if data is missing what value should it have), columns (width of the display column), align (cell data display alignment), measure (nominal, ordinal, or scale), and the variable’s role (input, target, both, split, partition, or none, which is used for regression analysis) (Field, 2013; Miller, n.d.).

References: