Regression formulas are useful for summarizing the relationship between the variables in question (Huck, 2011). There are multiple types of regression all of them are tests of prediction (Huck, 2011; Schumacker, 2014). The least squares (linear) regression is the most well-known because it uses basic algebra, a straight line, and the correlation coefficient to aid in stating the regression’s prediction strength (Huck, 2011; Schumacker, 2014). The linear regression model is:
y = (a + bx) + e (1)
Where y is the dependent variable, x is the independent variable, a (the intercept) and b (the regression weight, also known as the slope) are a constants that are to be defined through the regression analysis, and e is the regression prediction error (Field, 2013; Schumacker, 2014). The sum of the squared errors should be minimized per the least squares criterion, and that is reflected in the b term in equation 1 (Schumacker, 2014).
Correlation coefficients help define the strength of the regression formula in defining the relationships between the variables, and can vary in value from -1 to +1. The closer the correlation coefficient is to -1 or +1; it informs the researcher that the regression formula is a good predictor of the variance between the variables. The closer the correlation coefficient is to zero, indicates that there is hardly any relationship between the variable (Field, 2013; Huck, 2011; Schumacker, 2014). Correlations never imply causation, but they can help determine the percentage of the variances between the variables by the regression formula result when the correlation value is squared (r2) (Field, 2013).
Assumptions for the General Least Square Model (GLM) modeling for regression and correlations
The General Least Squares Model (GLM) is the line of best fit, for linear regressions modeling along with its corresponding correlations (Smith, 2015). There are five assumptions to a linear regression model: additivity, linearity, independent errors, homoscedasticity, and normally distributed errors. Variables should be linearly related the independent variables(s), and the combined effects of multiple independent variables should be additive. A residual is the difference between the predicted value from the observed value: (1) no two residuals should be correlated, which can be numerically tested by using the Durbin-Watson test; (2) the variance of these residuals should be constant for each independent variable; and (3) the residuals should be random and normally distributed with a mean of 0 (Field, 2013; Schumacker, 2014).
Covering the issues with transforming variables to make them linear
When viewing the data through scatter plots, if the linearity and additivity assumptions could not be met, then transformations to the variables could be made to make the relationship linear. The above is an iterative trial and error process. Transformation must occur to every point of the data set to correct for the linearity and addititvity issues since it changes the difference between the variables due to the change of units in the variables (Field, 2013).
Table 1: Types of data transformations and their uses (adapted from Field (2013) Table 5.1).
Data Transformation | Can Correct for |
Log [independent variable(s)] | Positive skew, positive kurtosis, unequal variances, lack of linearity |
Square root [independent variable(s)] | Positive skew, positive kurtosis, unequal variances, lack of linearity |
Reciprocal [independent variable(s)] | Positive skew, positive kurtosis, unequal variances |
Reverse score [independent variable(s)]: subtracting the highest value in the variable for each data set | Negative skew |
Describe the R procedures for linear regression
lm( ) is a function for running linear regression, glm( ) is a function for running logistic regression (should not be confused for GLM), and loglm( ) is a function for running log-linear regression in R (Schumacker, 2014; Smith, 2015). The summary( ) function is used to output the results of the linear regression. Dependent variables are represented with a tilde “~” and independent variables are represented with a “+” (Schumacker, 2014). Thus, the R procedures for linear regression are (Marin, 2013):
> cor (x, y) # correlation coefficient
> myRegression = lm (y ~ x, data = dataSet ) # conduct a linear regression on x and y
> summary(myRegression) # produces the outputs of the lm( ) function calculations
> attributes(myRegression) # lists the attributes of the lm( ) function
> myRegression$coefficients # gives you the slope and intercept coefficients
> plot (x, y, main=“Title to graph”) # scatter plot
> abline(myRegression) # regression line
> confint(myRegression, level= 0.99) # 99% level of confidence intervals for the regression coefficients
> anova(myRegression) # anova analysis on the regression analysis
References
- Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
- Huck, S. W. (2011) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
- Marin, M. (2013) Linear regression in R (R tutorial 5.1). Retrieved from https://www.youtube.com/watch?v=66z_MRwtFJM
- Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.
- Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html