Adv DBs: A possible future project?

Below is a possible future research paper on a database related subject.

Title: Using MapReduce to aid in clinical test utilization patterns in the medicine

The motivation:

Efficient processing and analysis of clinical data could aid in better clinical tests on patients, and MapReduce solutions allow for an integrated solution in the medical field, which aids in saving resources when it comes to moving data in and out of storage.

The problem statement (symptom and root cause)

The rates of Sexually Transmitted Infections (STIs) are increasing at alarming rates, could the addition of Roper Saint Francis Clinical Network in the South test utilization patterns into Hadoop with MapReduce reveal patterns in the current STIs population and predict areas where an outbreak may be imminent?

The hypothesis statement (propose a solution and address the root cause)

H0: Data mining in Hadoop with MapReduce will not be able to identify any meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

H1: Data mining in Hadoop with MapReduce can identify a meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

The research questions

Could this study apply to STIs outbreaks rates be generalized into other disease outbreak rates?

Is this application of data-mining in Hadoop with MapReduce the correct way to analyze the data?

The professional significance statement (new contribution to the body of knowledge)

Identifying where an outbreak of any disease (or STIs), via clinical tests utilization patterns has yet to be done according to Mohammed et al (2014), and they have stated that Hadoop with MapReduce is a great tool for clinical work because it has been adopted in similar fields of medicine like bioinformatics.

Resources

  • Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce programming framework to clinical big data analysis: Current landscape and future trends. Biodata Mining, 7. doi:http://dx.doi.org/10.1186/1756-0381-7-22 – Doctoral Library Advanced Technologies & Aerospace CollectionPokorny, J. (2011).
  • NoSQL databases: A step to database scalability in web environment. In iiWAS ’11 Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services (pp. 278-283). – Doctoral Library ACM Digital Library

Quant: Compelling topics

Most Compelling Topics

Field (2013) states that both quantitative and qualitative methods are complimentary at best, none competing approaches to solving the world’s problems. Although these methods are quite different from each other. Simply put, quantitative methods are utilized when the research contains variables that are numerical, and qualitative methods are utilized when the research contains variables that are based on language (Field, 2013).  Thus, central to quantitative research and methods is to understand the numerical, ordinal, or categorical dataset and what the data represents. This can be done through either descriptive statistics, where the researcher uses statistics to help describe a data set, or it can be done through inferential statistics, where conclusions can be drawn about the data set (Miller, n.d.).

Field (2013) and Schumacker (2014), defined central tendency as an all-encompassing term to help describe the “center of a frequency distribution” through the commonly used measures mean, median, and mode.  Outliers, missing values, and multiplication of a constant, and adding a constant are factors that affect the central tendency (Schumacker, 2014).  Besides just looking at one central tendency measure, researchers can also analyze the mean and median together to understand how skewed the data is and in which direction.  Heavily skewed distributions would heavily increase the distance between these two values, and if the mean less than the median the distribution is skewed negatively (Field, 2013).  To understand the distribution, better other measures like variance and standard deviations could be used.

Variance and standard deviations are considered as measures of dispersion, where the variance is considered as measures of average dispersion (Field, 2013; Schumacker, 2014).  Variance is a numerical value that describes how the observed data values are spread across the data distribution and how they differ from the mean on average (Huck, 2011; Field, 2013; Schumacker, 2014).  The smaller the variance indicates that the observed data values are close to the mean and vice versa (Field, 2013).

Rarely is every member of the population studied, and instead a sample from that population is randomly taken to represent that population for analysis in quantitative research (Gall, Gall, & Borg 2006). At the end of the day, the insights gained from this type of research should be impersonal, objective, and generalizable.  To generalize the results of the research the insights gained from a sample of data needs to use the correct mathematical procedures for using probabilities and information, statistical inference (Gall et al., 2006).  Gall et al. (2006), stated that statistical inference is what dictates the order of procedures, for instance, a hypothesis and a null hypothesis must be defined before a statistical significance level, which also has to be defined before calculating a z or t statistic value.  Essentially, a statistical inference allows for quantitative researchers to make inferences about a population.  A population, where researchers must remember where that data was generated and collected from during quantitative research process.

Most flaws in research methodology exist because the validity and reliability weren’t established (Gall et al., 2006). Thus, it is important to ensure a valid and reliable assessment instrument.  So, in using any existing survey as an assessment instrument, one should report the instrument’s: development, items, scales, reports on reliability, and reports on validity through past uses (Creswell, 2014; Joyner, 2012).  Permission must be secured for using any instrument and placed in the appendix (Joyner, 2012).  The validity of the assessment instrument is key to drawing meaningful and useful statistical inferences (Creswell, 2014).

Through sampling of a population and using a valid and reliable survey instrument for assessment, attitudes and opinions about a population could be correctly inferred from the sample (Creswell, 2014).  Sometimes, a survey instrument doesn’t fit those in the target group. Thus it would not produce valid nor reliable inferences for the targeted population. One must select a targeted population and determine the size of that stratified population (Creswell, 2014).

Parametric statistics, are inferential and based on random sampling from a distinct population, and that the sample data is making strict inferences about the population’s parameters, thus tests like t-tests, chi-square, f-tests (ANOVA) can be used (Huck, 2011; Schumacker, 2014).  Nonparametric statistics, “assumption-free tests”, is used for tests that are using ranked data like Mann-Whitney U-test, Wilcoxon Signed-Rank test, Kruskal-Wallis H-test, and chi-square (Field, 2013; Huck, 2011).

First, there is a need to define the types of data.  Continuous data is interval/ratio data, and categorical data is nominal/ordinal data.  Modified from Schumacker (2014) with data added from Huck (2011):

Statistic Dependent Variable Independent Variable
Analysis of Variance (ANOVA)
     One way Continuous Categorical
t-Tests
     Single Sample Continuous
     Independent groups Continuous Categorical
     Dependent (paired groups) Continuous Categorical
Chi-square Categorical Categorical
Mann-Whitney U-test Ordinal Ordinal
Wilcoxon Ordinal Ordinal
Kruskal-Wallis H-test Ordinal Ordinal

So, meaningful results get reported and their statistical significance, confidence intervals and effect sizes (Creswell, 2014). If the results from a statistical test have a low probability of occurring by chance (5% or 1% or less) then the statistical test is considered significant (Creswell, 2014; Field, 2014; Huck, 2011Statistical significance test can have the same effect yet result in different values (Field, 2014).  Statistical significance on large samples sizes can be affected by small differences and can show up as significant, while in smaller samples large differences may be deemed insignificant (Field, 2014).  Statistically significant results allow the researcher to reject a null hypothesis but do not test the importance of the observations made (Huck, 2011).  Huck (2011) stated two main factors that could influence whether or not a result is statistically significant is the quality of the research question and research design.

Huck (2011) suggested that after statistical significance is calculated and the research can either reject or fail to reject a null hypothesis, effect size analysis should be conducted.  The effect size allows researchers to measure objectively the magnitude or practical significance of the research findings through looking at the differential impact of the variables (Huck, 2011; Field, 2014).  Field (2014), defines one way of measuring the effect size is through Cohen’s d: d = (Avg(x1) – Avg(x2))/(standard deviation).  If d = 0.2 there is a small effect, d = 0.5 there is a moderate effect, and d = 0.8 or more there is a large effect (Field, 2014; Huck, 2011). Thus, this could be the reason why a statistical test could yield a statistically significant value, but further analysis with effect size could show that those statistically significant results do not explain much of what is happening in the total relationship.

In regression analysis, it should be possible to predict the dependent variable based on the independent variables, depending on two factors: (1) that the productivity assessment tool is valid and reliable (Creswell, 2014) and (2) we have a large enough sample size to conduct our analysis and be able to draw statistical inference of the population based on the sample data which has been collected (Huck, 2011). Assuming these two conditions are met, then regression analysis could be made on the data to create a prediction formula. Regression formulas are useful for summarizing the relationship between the variables in question (Huck, 2011).

When modeling predict the dependent variable based upon the independent variable the regression model with the strongest correlation will be used as it is that regression formula that explains the variance between the variables the best.   However, just because the regression formula can predict some or most of the variance between the variables, it will never imply causation (Field, 2013).  Correlations help define the strength of the regression formula in defining the relationships between the variables, and can vary in value from -1 to +1.  The closer the correlation coefficient is to -1 or +1; it informs the researcher that the regression formula is a good predictor of the variance between the variables.  The closer the correlation coefficient is to zero, indicates that there is hardly any relationship between the variable (Field, 2013; Huck, 2011; Schumacker, 2014).  It should never be forgotten that correlation doesn’t imply causation, but can help determine the percentage of the variances between the variables by the regression formula result, when the correlation value is squared (r2) (Field, 2013).

 

References:

  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Gall, M. D., Gall, J., & Borg W. (2006). Educational research: An introduction (8th ed.). Pearson Learning Solutions. VitalBook file.
  • Huck, S. W. (2011) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Joyner, R. L. (2012) Writing the Winning Thesis or Dissertation: A Step-by-Step Guide (3rd ed.). Corwin. VitalBook file.
  • Miller, R. (n.d.). Week 1: Central tendency [Video file]. Retrieved from http://breeze.careeredonline.com/p9fynztexn6/?launcher=false&fcsContent=true&pbMode=normal
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.

Quant: Chi-Square Test in SPSS

Introduction The aim of this analysis is to determine the association strength for the variables agecat and degree as well the major contributing cells through a chi-square analysis. Through the use of standardized residuals, it should aid in determining the cell contributions. Hypothesis Null: There is no basis of difference between the agecat and degree … Continue reading “Quant: Chi-Square Test in SPSS”

Introduction

The aim of this analysis is to determine the association strength for the variables agecat and degree as well the major contributing cells through a chi-square analysis. Through the use of standardized residuals, it should aid in determining the cell contributions.

Hypothesis

  • Null: There is no basis of difference between the agecat and degree
  • Alternative: There is are real differences between the agecat and degree

Methodology

For this project, the gss.sav file is loaded into SPSS (GSS, n.d.).  The goal is to look at the relationships between the following variables: agecat (Age category) and degree (Respondent’s highest degree).

To conduct a chi-square analysis, navigate through Analyze > Descriptive Statistics > Crosstabs.

The variable degree was placed in the “Row(s)” box and agecat was placed under “Column(s)” box.  Select “Statistics” button and select “Chi-square” and under the “Nominal” section select “Lambda”. Select the “Cells” button and select “Standardized” under the “Residuals” section. The procedures for this analysis are provided in video tutorial form by Miller (n.d.).  The following output were observed in the next four tables.

Results

Table 1: Case processing summary.

Cases
Valid Missing Total
N Percent N Percent N Percent
Degree * Age category 1411 99.4% 8 0.6% 1419 100.0%

From the total sample size of 1419 participants, 8 cases are reported to be missing, yielding a 99.4% response rate (Table 1).   Examining the cross tabulation, for the age groups 30-39, 40-49, 50-59, and 60-89 the standardized residual is far less than -1.96 or far greater than +1.96 respectively.  Thus, the frequencies between these two differ significantly.  Finally, for the 60-89 age group the standardized residual is less than -1.96, making these two frequencies differ significantly.  Thus, for all these frequencies, SPSS identified that the observed frequencies are far apart from the expected frequencies (Miller, n.d.).  For those significant standardized residuals that are negative is pointing out that the SPSS model is over predicting people of that age group with that respective diploma (or lack thereof).  For those significant standardized residuals that are positive is point out that the SPSS model is under-predicting people of that age group with a lack of a diploma.

Table 2: Degree by Age category crosstabulation.

Age category Total
18-29 30-39 40-49 50-59 60-89
Degree Less than high school Count 42 33 36 20 112 243
Standardized Residual -.1 -2.8 -2.3 -2.7 7.1
High school Count 138 162 154 113 158 725
Standardized Residual .9 .2 -.2 .4 -1.2
Junior college or more Count 68 115 114 78 68 443
Standardized Residual -1.1 1.8 1.9 1.4 -3.7
Total Count 248 310 304 211 338 1411

Deriving the degrees of freedom from Table 2, df = (5-1)*(3-1) is 8.  However, none of the expected counts were less than five because the minimum expected count is 36.3 (Table 3) which is desirable.  The chi-squared value is 96.364 and is significance at the 0.05 level. Thus, the null hypothesis is rejected, and there is a statistically significant association between a person’s age category and diploma level.  This test doesn’t tell us anything about the directionality of the relationship.

Table 3: Chi-Square Tests

Value df Asymptotic Significance (2-sided)
Pearson Chi-Square 96.364a 8 .000
Likelihood Ratio 90.580 8 .000
Linear-by-Linear Association 23.082 1 .000
N of Valid Cases 1411
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 36.34.

Table 4: Directional Measures

Value Asymptotic Standard Errora Approximate Tb Approximate Significance
Nominal by Nominal Lambda Symmetric .029 .013 2.278 .023
Degree Dependent .000 .000 .c .c
Age category Dependent .048 .020 2.278 .023
Goodman and Kruskal tau Degree Dependent .024 .005 .000d
Age category Dependent .019 .004 .000d
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.
c. Cannot be computed because the asymptotic standard error equals zero.
d. Based on chi-square approximation

Since there is a statistically significant association between a person’s age category and diploma level, the chi-square test doesn’t show how much these variables are related to each other. The lambda value (when we reject the null hypothesis) is 0.029; there is a 2.9% relationship between the two variables. Thus the relationship has a very weak effect (Table 4). Thus, 2.9% of the variance is accounted for, and there is nothing going on in here.

Conclusions

There is a statistically significant association between a person’s age category and diploma level.  According to the crosstabulation, the SPSS model is significantly over-predicting the number of people with less education than a high school diploma for the age groups of 20-59 as well as those with a college degree for the 60-89 age group.  This difference in the standard residual helped drive a large and statistically significant chi-square value. With a lambda of 0.029, it shows that 2.9% of the variance is accounted for, and there is nothing going on in here.

SPSS Code

CROSSTABS

  /TABLES=ndegree BY agecat

  /FORMAT=AVALUE TABLES

  /STATISTICS=CHISQ CC LAMBDA

  /CELLS=COUNT SRESID

  /COUNT ROUND CELL.

References:

Quant: Statistical Significance

In quantitative research methodologies, meaningful results get reported and their statistical significance, confidence intervals and effect sizes (Creswell, 2014). If the results from a statistical test have a low probability of occurring by chance (5% or 1% or less) then the statistical test is considered significant (Creswell, 2014; Field, 2014; Huck, 2011).  Low statistical significance values usually try to protect against type I errors (Huck, 2011). Statistical significance test can have the same effect yet result in different values (Field, 2014).  Statistical significance on large samples sizes can be affected by small differences and can show up as significant, while in smaller samples large differences may be deemed insignificant (Field, 2014).  Statistically significant results allow the researcher to reject a null hypothesis but do not test the importance of the observations made (Huck, 2011).  Huck (2011) stated two main factors that could influence whether or not a result is statistically significant is the quality of the research question and research design.  This is why Creswell (2014) also stated confidence intervals and effect size. Confidence intervals explain a range of values that describe the uncertainty of the overall observation and effect size defines the strength of the conclusions made on the observations (Creswell, 2014).  Huck (2011) suggested that after statistical significance is calculated and the research can either reject or fail to reject a null hypothesis, effect size analysis should be conducted.  The effect size allows researchers to measure objectively the magnitude or practical significance of the research findings through looking at the differential impact of the variables (Huck, 2011; Field, 2014).  Field (2014), defines one way of measuring the effect size is through Cohen’s d: d = (Avg(x1) – Avg(x2))/(standard deviation).  There are multiple ways to pick a standard deviation for the denominator of the effect size equation: control group standard deviation, group standard deviation, population standard deviation or pooling the groups of standard deviations that are assuming there is independence between the groups (Field, 2014).   If d = 0.2 there is a small effect, d = 0.5 there is a moderate effect, and d = 0.8 or more there is a large effect (Field, 2014; Huck, 2011). Thus, this could be the reason why the statistical test yielded a statistically significant value, but further analysis with effect size could show that those statistically significant results do not explain much of what is happening in the total relationship.

Resources

  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Field, A. (2011) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Huck, S. W. (2013) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.

Quant: ANOVA and Multiple Comparisons in SPSS

Introduction

The aim of this analysis is to look at the relationship between the dependent variable of the income level of respondents (rincdol) and the independent variable of their reported level of happiness (happy).   This independent variable has at least 3 or more levels within it.

From the SPSS outputs the goal is to:

  • How to use the ANOVA program to determine the overall conclusion. Use of the Bonferroni correction as a post-hoc analysis to determine the relationship of specific levels of happiness to income.

Hypothesis

  • Null: There is no basis of difference between the overall rincdol and happy
  • Alternative: There is are real differences between the overall rincdol and happy
  • Null2: There is no basis of difference between the certain pairs of rincdol and happy
  • Alternative2: There is are real differences between the certain pairs of rincdol and happy

Methodology

For this project, the gss.sav file is loaded into SPSS (GSS, n.d.).  The goal is to look at the relationships between the following variables: rincdol (Respondent’s income; ranges recoded to midpoints) and happy (General Happiness). To conduct a parametric analysis, navigate to Analyze > Compare Means > One-Way ANOVA.  The variable rincdol was placed in the “Dependent List” box, and happy was placed under “Factor” box.  Select “Post Hoc” and under the “Equal Variances Assumed” select “Bonferroni”.  The procedures for this analysis are provided in video tutorial form by Miller (n.d.). The following output was observed in the next two tables.

The relationship between rincdol and happy are plotted by using the chart builder.  Code to run the chart builder code is shown in the code section, and the resulting image is shown in the results section.

Results

Table 1: ANOVA

Respondent’s income; ranges recoded to midpoints
Sum of Squares df Mean Square F Sig.
Between Groups 11009722680.000 2 5504861341.000 9.889 .000
Within Groups 499905585000.000 898 556687733.900
Total 510915307700.000 900

Through the ANOVA analysis, Table 1, it shows that the overall ANOVA shows statistical significance, such that the first Null hypothesis is rejected at the 0.05 level. Thus, there is a statistically significant difference in the relationship between the overall rincdol and happy variables.  However, the difference between the means at various levels.

Table 2: Multiple Comparisons

Dependent Variable:   Respondent’s income; ranges recoded to midpoints
Bonferroni
(I) GENERAL HAPPINESS (J) GENERAL HAPPINESS Mean Difference (I-J) Std. Error Sig. 95% Confidence Interval
Lower Bound Upper Bound
VERY HAPPY PRETTY HAPPY 4093.678 1744.832 .058 -91.26 8278.61
NOT TOO HAPPY 12808.643* 2912.527 .000 5823.02 19794.26
PRETTY HAPPY VERY HAPPY -4093.678 1744.832 .058 -8278.61 91.26
NOT TOO HAPPY 8714.965* 2740.045 .005 2143.04 15286.89
NOT TOO HAPPY VERY HAPPY -12808.643* 2912.527 .000 -19794.26 -5823.02
PRETTY HAPPY -8714.965* 2740.045 .005 -15286.89 -2143.04
*. The mean difference is significant at the 0.05 level.

According to Table 2, for the pairings of “Very Happy” and “Pretty Happy” did not disprove the Null2 for that case at the 0.05 level. But, all other pairings “Very Happy” and “Not Too Happy” with “Pretty Happy” and “Not Too Happy” can reject the Null2 hypothesis at the 0.05 level.  Thus, there is a difference when comparing across the three different pairs.

u3db3f1

Figure 1: Graphed means of General Happiness versus incomes.

The relationship between general happiness and income are positively correlated (Figure 1).  That means that a low level of general happiness in a person usually have lower recorded mean incomes and vice versa.  There is no direction or causality that can be made from this analysis.  It is not that high amounts of income cause general happiness, or happy people make more money due to their positivism attitude towards life.

SPSS Code

DATASET NAME DataSet1 WINDOW=FRONT.

ONEWAY rincdol BY happy

  /MISSING ANALYSIS

  /POSTHOC=BONFERRONI ALPHA(0.05).

* Chart Builder.

GGRAPH

  /GRAPHDATASET NAME=”graphdataset” VARIABLES=happy MEAN(rincdol)[name=”MEAN_rincdol”]

    MISSING=LISTWISE REPORTMISSING=NO

  /GRAPHSPEC SOURCE=INLINE.

BEGIN GPL

  SOURCE: s=userSource(id(“graphdataset”))

  DATA: happy=col(source(s), name(“happy”), unit.category())

  DATA: MEAN_rincdol=col(source(s), name(“MEAN_rincdol”))

  GUIDE: axis(dim(1), label(“GENERAL HAPPINESS”))

  GUIDE: axis(dim(2), label(“Mean Respondent’s income; ranges recoded to midpoints”))

  SCALE: cat(dim(1), include(“1”, “2”, “3”))

  SCALE: linear(dim(2), include(0))

  ELEMENT: line(position(happy*MEAN_rincdol), missing.wings())

END GPL.

References:

Quant: Paired Sample Statistics in SPSS

Introduction

The aim of this analysis is to conduct a comparison of productivity under two organizational structures: The data are artificial estimates of productivity with column 1 representing traditional vertical management and column 2 representing other autonomous work teams (ATW). The background is that a company of 100 factory workers had been operating under traditional vertical management and decided to move to ATW. The same employees were involved in both systems having first worked under vertical management and then being converted to ATW.

From the SPSS outputs the goal is to:

  • Analyze the productivity levels of the 2 management approaches, and decide which is superior.

Hypothesis

  • Null: There is no basis of difference between the prodpre and prodpost
  • Alternative: There is are real differences between the prodpre and prodpost

Methodology

For this project, the atw.sav file is loaded into SPSS (ATW, n.d.).  The goal is to look at the relationships between the following variables: prodpre (productivity level preceding the new process) and prodpost (productivity level following the new process). To conduct a parametric analysis, navigate to Analyze > Compare Means > Paired-Samples T Test.  The variable prodpre was placed in the “Paired Variables” box under “Pair” 1 and “Variable 1”, and prodpost was placed under “Pair” 1 and “Variable 2”.  The procedures for this analysis are provided in video tutorial form by Miller (n.d.). The following output was observed in the next three tables.

Results

Table 1: Paired Sample Statistics

Mean N Std. Deviation Std. Error Mean
Pair 1 productivity level preceding the new process 76.43 100 16.820 1.682
productivity level following the new process 84.24 100 9.797 .980

Descriptively, productivity on average increased by 8 points, and the standard deviation about the mean decreased by 7 points.  This means that the estimates of productivity under the traditional vertical management are less than and showcase a wider spread than those of the estimates of productivity under the autonomous work teams.  Essentially these distributions tell the story that the workers are getting better productivity estimates with less deviation under autonomous work teams.

Table 2: Paired Samples Correlation

N Correlation Sig.
Pair 1 productivity level preceding the new process & productivity level following the new process 100 .040 .695

Based on Table 2, there is a weak correlation (r = 0.040) between the estimates of productivity under the traditional vertical management and the autonomous work teams.  Although correlation does not imply causation.

Table 3: Paired Samples Test

Paired Differences t df Sig. (2-tailed)
Mean Std. Deviation Std. Error Mean 95% Confidence Interval of the Difference
Lower Upper
Pair 1 productivity level preceding the new process – productivity level following the new process -7.817 19.126 1.913 -11.612 -4.022 -4.087 99 .000

Based on the results from the 2-tailed student t-tests (Table 3), the null hypothesis can be rejected.  There is a significant difference between the two variables prodpre and prodpost at the 0.05 level or less.  The data based on 100 workers (with degrees of freedom of 99) show that there is a significance in the estimates of productivity under the traditional vertical management and the autonomous work teams.

SPSS Code

DATASET NAME DataSet1 WINDOW=FRONT.

T-TEST PAIRS=prodpre WITH prodpost (PAIRED)

  /CRITERIA=CI(.9500)

  /MISSING=ANALYSIS.

References:

Quant: Parametric and Non-Parametric Stats

Parametric statistics is inferential and based on random sampling from a well-defined population, and that the sample data is making strict inferences about the population’s parameters. Thus tests like t-tests, chi-square, f-tests (ANOVA) can be used (Huck, 2011; Schumacker, 2014).  Nonparametric statistics, “assumption-free tests”, is used for tests that are using ranked data like Mann-Whitney U-test, Wilcoxon Signed-Rank test, Kruskal-Wallis H-test, and chi-square (Field, 2013; Huck, 2011).

First, there is a need to define the types of data.  Continuous data is interval/ratio data, and categorical data is nominal/ordinal data.  Modified from Schumacker (2014) with data added from Huck (2011):

Statistic Dependent Variable Independent Variable
Analysis of Variance (ANOVA)
     One way Continuous Categorical
t-Tests
     Single Sample Continuous
     Independent groups Continuous Categorical
     Dependent (paired groups) Continuous Categorical
Chi-square Categorical Categorical
Mann-Whitney U-test Ordinal Ordinal
Wilcoxon Ordinal Ordinal
Kruskal-Wallis H-test Ordinal Ordinal

ANOVAs (or F-tests) are used to analyze the differences in a group of three or more means, through studying the variation between the groups, and tests the null hypothesis to see if the means between the groups are equal (Huck, 2011). Student t-tests, or t-tests, test as a null hypothesis that the mean of a population has some specified number and is used when the sample size is relatively small compared to the population size (Field, 2013; Huck, 2011; Schumacker, 2014).  The test assumes a normal distribution (Huck, 2011). With large sample sizes, t-test/values are the same as z-tests/values, the same can happen with chi-square, as t and chi-square are distributions with samples size in their function (Schumacker, 2014).  In other words, at large sample sizes the t-distribution and chi-square distribution begin to look like a normal curve.  Chi-square is related to the variance of a sample, and the chi-square tests are used for testing the null hypothesis, which is the sample mean is part of a normal distribution (Schumacker, 2014).  Chi-square tests are so versatile it can be used as a parametric and non-parametric test (Field, 2013; Huck, 2011; Schumacker, 2014).

The Mann-Whiteney U-test and Wilcox signed-rank test are both equivalent, since they are the non-parametric equivalent of the t-tests and the samples don’t even have to be of the same sample length (Field, 2013).

The nonparametric Mann-Whitney U-test can be substituted for a t-test when the normal distribution cannot be assumed and was designed for two independent samples that do not have repeated measures (Field, 2013; Huck, 2011). Thus, this makes this a great substitution for the independent group’s t-test (Field, 2013). A benefit of choosing the Mann-Whitney U test is that it probably will not produce type II error-false negative (Huck, 2011). The null hypothesis is that the two independent samples come from the same population (Field, 2013; Huck, 2011).

The nonparametric Wilcoxon signed-rank test is best for distributions that are skewed, where variance homogeneity cannot be assumed, and a normal distribution cannot be assumed (Field, 2013; Huck, 2011).  Wilcoxon signed test can help compare two related/correlated samples from the same population (Huck, 2011). Each pair of data is chosen randomly and independently and not repeating between the pairs (Huck, 2011).  This is a great substitution for the dependent t-tests (Field, 2013; Huck, 2011).  The null hypothesis is that the central tendency is 0 (Huck, 2011).

The nonparametric Kruskal-Wallis H-test can be used to compare two or more independent samples from the same distribution, which is considered to be like a one-way analysis of variance (ANOVA) and focuses on central tendencies (Huck, 2011).  It is usually an extension of the Mann-Whitney U-test (Huck, 2011). The null hypothesis is that the medians in all groups are equal (Huck, 2011).

References

  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Huck, S. W. (2011) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.