Adv Quant: Logistic Vs Linear Regression

To generalize the results of the research the insights gained from a sample of data needs to use the correct mathematical procedures for using probabilities and information, statistical inference (Gall et al., 2006; Smith, 2015).  Gall et al. (2006), stated that statistical inference is what dictates the order of procedures, for instance, a hypothesis and a null hypothesis must be defined before a statistical significance level, which also has to be defined before calculating a z or t statistic value. Essentially, a statistical inference allows for quantitative researchers to make inferences about a population.  A population, where researchers must remember where that data was generated and collected from during quantitative research process.  The orders of procedures are important to apply statistical inferences to regressions, if not the prediction formula will not be generalizable.

Logistic regression is another flavor of multi-variable regression, where one or more independent variables are continuous or categorical which are used to predict a dichotomous/ binary/ categorical dependent variable (Ahlemeyer-Stubbe, & Coleman, 2014; Field, 2013; Gall, Gall, & Borg, 2006; Huck, 2011).  Logistic regression is an alternative to linear regression, which assumes all variables are continuous (Ahlemeyer-Stubbe, & Coleman, 2014). Both the multi-variable linear regression and logistic regression formula are (Field, 2013; Schumacker, 2014):

Y = a + b11 + b2X2 + …                                                       (1)

The main difference between these two regressions is that the variables in the equation (1) represent different types of dependent (Y) and independent variables (Xi).  These different types of variables may have to undergo a transformation before the regression analysis begins (Field, 2013; Schumacker 2014).  Due to the difference in the types of variables between logistic and linear regression the assumptions on when to use either regression are also different (Table 1).

Table 1: Discusses and summarizes the types of assumptions and variables used in both logistic and regular regression, created from Ahlemeyer-Stubbe & Coleman (2014), Field (2013), Gall et al. (2006), Huck (2011) and Schumacker, (2014).

 

Assumptions of Logistic Regression Assumptions for Linear Regression
·         Multicollinearity should be minimized between the independent variables

·         There is no need for linearity between the dependent and independent variables

·         Normality only on the continuous independent variables

·         No need for homogeneity of variance within the categorical variables

·         Error terms a not normally distributed

·         Independent variables don’t have to be continuous

·         There are no missing data (no null values)

·         Variance that is not zero

·         Multicollinearity should be minimized between the multiple independent variables

·         Linearity exists between all variables

·         Additivity (for multi-variable linear regression)

·         Errors in the dependent variable and its predicted values are independent and uncorrelated

·         All variables are continuous

·         Normality on all variables

·         Normality on the error values

·         Homogeneity of variance

·         Homoscedasticity- variance between residuals are constant

·         Variance that is not zero

Variable Types of Logistic Regression Variable Types of Linear Regression
·         2 or more Independent variables

·         Independent variables: continuous, dichotomous, binary, or categorical

·         Dependent variable: dichotomous, binary

·         1 or more Independent variables

·         Independent variables: continuous

·         Dependent variables: continuous

References

  • Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Gall, M. D., Gall, J. P., Borg, W. R. (2006). Educational Research: An Introduction, 8th Edition. [VitalSource Bookshelf Online].
  • Field, Andy. (2013). Discovering Statistics Using IBM SPSS Statistics, 4th Edition. [VitalSource Bookshelf Online].
  • Huck, Schuyler W. (2011). Reading Statistics and Research, 6th Edition. [VitalSource Bookshelf Online].
  • Schumacker, Randall E. (2014). Learning Statistics Using R, 1st Edition. [VitalSource Bookshelf Online].
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html

Quant: Linear Regression in SPSS

Introduction

The aim of this analysis is to look at the relationship between a father’s education level (dependent variable) when you know the mother’s education level (independent variable). The variable names are “paeduc” and “maeduc.” Thus, the hope is to determine the linear regression equation for predicting the father’s education level from the mother’s education.

From the SPSS outputs the following questions will be addressed:

  • How much of the total variance have you accounted for with the equation?
  • Based upon your equation, what level of education would you predict for the father when the mother has 16 years of education?

Methodology

For this project, the gss.sav file is loaded into SPSS (GSS, n.d.).  The goal is to look at the relationships between the following variables: paeduc (HIGHEST YEAR SCHOOL COMPLETED, FATHER) and maeduc (HIGHEST YEAR SCHOOL COMPLETED, MOTHER). To conduct a linear regression analysis navigate through Analyze > Regression > Linear Regression.  The variable paeduc was placed in the “Dependent List” box, and maeduc was placed under “Independent(s)” box.  The procedures for this analysis are provided in video tutorial form by Miller (n.d.).  The following output was observed in the next four tables.

The relationship between paeduc and maeduc are plotted in a scatterplot by using the chart builder.  Code to run the chart builder code is shown in the code section, and the resulting image is shown in the results section.

Results

Table 1: Variables Entered/Removed

Model Variables Entered Variables Removed Method
1 HIGHEST YEAR SCHOOL COMPLETED, MOTHERb . Enter
a. Dependent Variable: HIGHEST YEAR SCHOOL COMPLETED, FATHER
b. All requested variables entered.

Table 1, reports that for the linear regression analysis the dependent variable is the highest years of school completed for the father and the independent variable is the highest year of school completed by the mother.  No variables were removed.

Table 2: Model Summary

Model R R Square Adjusted R Square Std. Error of the Estimate
1 .639a .408 .407 3.162
a. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, MOTHER
b. Dependent Variable: HIGHEST YEAR SCHOOL COMPLETED, FATHER

For a linear regression trying to predict the father’s highest year of school completed based on his wife’s highest year of school completed, the correlation is positive with a value of 0.639, which can only 0.408 of the variance explained (Table 2) and 0.582 of the variance is unexplained.  The linear regression formula or line of best fit (Table 4) is: y = 0.76 x + (2.572 years) + e.  The line of best fit essentially explains in equation form the mathematical relationship between two variables and in this case the father’s and mother’s highest education level.  Thus, if the mother has completed her bachelors’ degree (16th year), then this equation would yield (y = 2.572 years + 0.76 (16 years) + e = 14.732 years + e).  The e is the error in this prediction formula, and it exists because of the r2 value is not exactly -1.0 or +1.0.  The ANOVA table (Table 3) describes that this relationship between these two variables is statistically significant at the 0.05 level.

Table 3: ANOVA Table

Model Sum of Squares df Mean Square F Sig.
1 Regression 6231.521 1 6231.521 623.457 .000b
Residual 9045.579 905 9.995
Total 15277.100 906
a. Dependent Variable: HIGHEST YEAR SCHOOL COMPLETED, FATHER
b. Predictors: (Constant), HIGHEST YEAR SCHOOL COMPLETED, MOTHER

Table 4: Coefficients

Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta
1 (Constant) 2.572 .367 7.009 .000
HIGHEST YEAR SCHOOL COMPLETED, MOTHER .760 .030 .639 24.969 .000
a. Dependent Variable: HIGHEST YEAR SCHOOL COMPLETED, FATHER

The image below (Figure 1), is a scatter plot, which is plotting the highest year of school completed by the mother vs. the father along with the linear regression line (Table 4) and box plot images of each respective distribution.  There are more outliers in the husband’s education level compared to those of the wife’s education level, and the spread of the education level is more concentrated about the median for the husband’s education level.

u4db1f1.png

Figure 1: Highest year of school completed by the mother vs the father scatter plot with regression line and box plot images of each respective distribution.

Conclusion

There is a statistically significant relation between the husband’s and wife’s highest year of education completed.  The line of best-fit formula shows a moderately positive correlation and is defined as y = 0.76 x + (2.572 years) + e; which can only explain 40.8% of the variance, while 58.2% of the variance is unexplained.

SPSS Code

DATASET NAME DataSet1 WINDOW=FRONT.

REGRESSION

  /MISSING LISTWISE

  /STATISTICS COEFF OUTS R ANOVA

  /CRITERIA=PIN(.05) POUT(.10)

  /NOORIGIN

  /DEPENDENT paeduc

  /METHOD=ENTER maeduc

  /CASEWISE PLOT(ZRESID) OUTLIERS(3).

STATS REGRESS PLOT YVARS=paeduc XVARS=maeduc

/OPTIONS CATEGORICAL=BARS GROUP=1 BOXPLOTS INDENT=15 YSCALE=75

/FITLINES LINEAR APPLYTO=TOTAL.

References:

Quant: Regression and Correlations

Through a regression analysis, it should be possible to predict the potential productivity based upon years of service, depending on two factors: (1) that the productivity assessment tool is valid and reliable (Creswell, 2014) and (2) we have a large enough sample size to conduct our analysis and be able to draw statistical inference of the population based on the sample data which has been collected (Huck, 2011). Assuming these two conditions are met, then regression analysis could be made on the data to create a prediction formula. Regression formulas are useful for summarizing the relationship between the variables in question (Huck, 2011). There are multiple types of regression all of them are tests of prediction: Linear, Multiple, Log-Linear, Quadratic, Cubic, etc. (Huck, 2011; Schumacker, 2014).  The linear regression is the most well-known because it uses basic algebra, a straight line, and the Pearson correlation coefficient to aid in stating the regression’s prediction strength (Huck, 2011; Schumacker, 2014).  The linear regression formula is: y = a + bx + e, where y is the dependent variable (in this case the productivity measure), x is the independent variable (years of service), a (the intercept) and b (the regression weight) are a constants that are to be defined through the regression analysis, and e is the regression prediction error (Field, 2013; Schumacker, 2014).  The sum of the errors should be equal to zero (Schumacker, 2014).

Linear regression models try to describe the relationship between one dependent and one independent variable, which are measured at the ratios or interval level (Schumacker, 2014).  However, other regression models are tested to find the best regression fit over the data.  Even though these are different regression tests, the goal for each regression model is to try to describe the current relationship between the dependent variable and the independent variable(s) and for predicting.  Multiple regression is used when there are multiple independent variables (Huck, 2011; Schumacker, 2014). Log-Linear Regression is using a categorical or continuously independent variable (Schumacker, 2014). Quadratic and Cubic regressions use a quadratic and cubic formula to help predict trends that are quadratic or cubic in nature respectively (Field, 2013).  When modeling predict potential productivity based upon years of service the regression with the strongest correlation will be used as it is that regression formula that explains the variance between the variables the best.   However, just because the regression formula can predict some or most of the variance between the variables, it will never imply causation (Field, 2013).

Correlations help define the strength of the regression formula in defining the relationships between the variables, and can vary in value from -1 to +1.  The closer the correlation coefficient is to -1 or +1; it informs the researcher that the regression formula is a good predictor of the variance between the variables.  The closer the correlation coefficient is to zero, indicates that there is hardly any relationship between the variable (Field, 2013; Huck, 2011; Schumacker, 2014).  A negative correlation could show that as the years of service increases the productivity measured is decreased, which could be caused by apathy or some other factor that has yet to be measured.  A positive correlation could show that as the years of service increases the productivity also measured increases, which could also be influenced by other factors that are not directly related to the years of service.  Thus, correlation doesn’t imply causation, but can help determine the percentage of the variances between the variables by the regression formula result, when the correlation value is squared (r2) (Field, 2013).

References

  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Huck, S. W. (2011) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.