Adv Quant: Locally Weighted Scatterplot Smothing (LOWESS) in R

Locally weighted scatterplot smoothing (LOWESS) method for multiple regression models in a k-nearest-neighbor-based model is a regression model with 1+ independent variables, which uses a non-parametric method which creates a smoothed surface/curve (Field, 2013; Smith, 2015).  LOWESS aims not to introduce a parametric model to the data, because doing so, would require much more resources (Cleveland, 1979).  Non-parametric tests have fewer assumptions than parametric tests, such as there are no assumptions on the sampled variable’s population distribution (Field, 2013; Huck, 2013; Schumacker, 2014; Smith, 2015).

Assumptions in the parametric analysis, which are based on the normal distribution, include (1) additivity and linearity; (2) normality; (3) homoscedasticity/homogeneity of variance; and (3) independence (Field, 2013; Huck, 2013). However, the assumption of independence still exists in the non-parametric analysis (Huck, 2013).  Smith (2015) states that these non-parametric analyses are less powerful than parametric analysis.  However, Field (2013) disagrees and says that they are powerful, but admits that there is a loss of information about the magnitude between the observed values.  Huck (2013), states that when using non-parametric analysis correctly, they have the similar power/weight to them as parametric analysis on data that meet the parametric assumptions. Thus, to conduct non-parametric analysis, data values are ranked and arranged, thus higher valued data have higher valued ranks and vice versa (Field, 2013; Huck, 2013; Smith, 2015). Cleveland (1979), describes that only a fraction of the data (local neighbors) are considered at a time, to minimize the weighing function.  Thus, a LOWESS regression is carried out on the ranked data, which help eliminates the effects of outliers, irons out skewed distributions (Field, 2013; Smith, 2015).

Advantages and disadvantages

+ LOWESS doesn’t depend on an underlying population distribution (Field, 2013; Huck, 2013; Schumacker, 2014; Smith, 2015)

+ Looking at the data’s local neighboring data creates a smoothing function, which visually enhances pattern (Cleveland, 1979)

– The LOWESS technique is not a substitute for parametric regression analysis (Huck, 2013).  Thus, to use non-parametric tests, one must reject the null hypothesis: the data follows a defined distribution; with its corresponding alternative hypothesis: the data does not follow a defined distribution (Field, 2013; Huck, 2013).

– LOWESS is computationally heavy, especially depending on the weights chosen (Cleveland, 1979).

– Though the regression formula is easily and visually represented/smoothed, but the regression formula may not be as cleanly written (Cleveland, 1979).

Multiple Regression Analysis

From the R dataset archived website (http://vincentarelbundock.github.io/Rdatasets/), the NOxEmissions.csv file was downloaded, which is the Nox Air Pollution Data and it has 5 variables: primary key, Julian Calendar Day (julday), hourly mean of NOx concentrations in the air in parts per billion (LNOx), hourly sum of NOx car emissions in parts per billion (LNOxEm), and square root of the wind speed in meters per second (sqrtWS).

From this dataset, it is hypothesized, that the wind speed combined with the sum of NOx from car emissions could contribute to the mean Nox concentrations in the atmosphere.  Thus, given that there are multiple independent variables for one dependent variable, then multiple regression analysis is best suited (Field, 2013; Huck, 2013; Schumacker, 2014; Smith, 2015).

IP1.51F1

Figure 1: Histogram of each of the variables in the data set.

IP1.51F2
Figure 2: Simple Linear Regression between each of the independent variables to the dependent variables.  For the image on the right the regression formula is LNOx = -0.86443(sqrtWS) + 5.55885, with a correlation of -0.4300 and for the image on the left the regression formula is LNOx = 0.590066 (LNOxEm) + 0.048645, with a correlation of 0.6399.

IP1.51F3

Figure 3: The summation output of the Linear Multiple Regression, where the regression formula is LNOx= -1.018198 (sqrtWS) + 0.641412 (LNOxEm) + 1.061950, which explains 66.3% of the variation between these variables.

IP1.51F4.png

Figure 4: Normal Quantile-Quantile plot, for the multiple linear regression as described by Figure 3.

The histograms (Figure 1) are not convincing that this could be tested with a normal multiple linear regression analysis, but from the Normal quantile-quantile plot (Figure 4), shows normalcy in the data, justifying the results (Figure 3).  For furthering the understanding of the multiple linear regression, the simple linear regression per independent variable (Figure 2), shows that neither independent variable alone explain the variance between the variables as well as with the multiple regression analysis.

IP1.51F5.png

Figure 5: Multiple LOWESS regression plot with varying smoothing span.

Even though there is normalcy in the data, a LOWESS was still plotted on the data, just to illustrate how the differences between smoothing factors can influence the result.  The smoothing factor describes how small the neighborhood is on the k-nearest neighbor (Cleveland, 1979).  The smaller the smoothing factor, the smaller the neighborhood, and the blue line (f=2/3) is the default value in R (R, n.d.e,).  The larger the smoothing factor, the bigger the neighborhood, over simplifying the result.

Code

NOxData = read.csv(file=”https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/robustbase/NOxEmissions.csv”, header = TRUE, sep = “,”)

head(NOxData)

hist(NOxData$LNOx, freq=F, xlab = “hourly mean of NOx concentrations [ppb]”,  main = “Histogram of the hourly mean of NOx concentrations”)

hist(NOxData$LNOxEm, freq=F, xlab = “hourly sum of NOx car emissions [ppb]”,  main = “Histogram of the hourly sum of NOx car emissions”)

hist(NOxData$sqrtWS, freq=F, xlab = “square root of winds [m/s]”, main = “Histogram of the square root of winds”)

# Single Linear Regressions on LNOxEm

## LNOx

plot(NOxData$LNOxEm, NOxData$LNOx)

abline(lm(NOxData$LNOx~NOxData$LNOxEm), col=”red”)

summary(lm(NOxData$LNOx~NOxData$LNOxEm))

cor(NOxData$LNOx,NOxData$LNOxEm)

## sqrtWS

plot(NOxData$sqrtWS, NOxData$LNOx)

abline(lm(NOxData$LNOx~NOxData$sqrtWS), col=”red”)

summary(lm(NOxData$LNOx~NOxData$sqrtWS))

cor(NOxData$LNOx,NOxData$sqrtWS)

# Multiple Linear Regression on both LNOxEM and sqrtWS variables on LNOx

RegressionModel = lm(NOxData$LNOx~ NOxData$LNOxEm + NOxData$sqrtWS)

summary(RegressionModel)     

plot(RegressionModel)

# Pearson’s Correlation between independent variables

cor(NOxData$LNOxEm, NOxData$sqrtWS)

# 95% Confidence Intervals on the regression model

confint(RegressionModel, conf.level=0.95)

# LOWESS MODEL

LowessModel = lowess(NOxData$LNOx~ NOxData$LNOxEm + NOxData$sqrtWS, f=2/3)

LowessModel2 = lowess(NOxData$LNOx~ NOxData$LNOxEm + NOxData$sqrtWS, f=0.01)

LowessModel3 = lowess(NOxData$LNOx~ NOxData$LNOxEm + NOxData$sqrtWS, f=1)

plot(LowessModel,type=”l”,col=”blue”, main=”LOWESS Regression: green is f=1, blue is f=2/3, & red is f=0.01″)

lines(LowessModel2, col=”red”)

lines(LowessModel3, col=”green”)

References