Adv Database Management: SQL Group functions

Please note that the following blog post provides a summary view for what you need to get done (left column) and quick examples that illustrate how to do it in SQL (right column with SQL code in red). For more information please see the resources below:

AVG, COUNT, MAX, MIN, STDDEV, SUM, VARIANCE
SELECT AVG(sal), MAX(sal), MIN(sal), SUM(sal)
  FROM emp
  WHERE jobs LIKE ‘Sales%’
COUNT
SELECT COUNT(*)
  FROM emp
  WHERE deptno = 30
Group By
SELECT deptno, AVG(sal)
  FROM emp
  GROUP BY deptno
Rollup and cube
SELECT   deptno, MAX(sal)
  FROM     emp
  GROUP BY deptno WITH ROLLUP [CUBE]
Having
SELECT   deptno, MAX(sal)
  FROM     emp
  GROUP BY deptno
  HAVING max(sal)>2900

 

Quant: Understanding Variance

If a researcher were to look at a measure of job performance resulting from 2 different manufacturing processes and found that the mean performance of process A was 82.5, and the mean performance of process B was 78.5, they could not automatically assume that process A will consistently outperform process B.  The reason the researchers cannot come to a conclusion until an analysis of variance done to that data.  There could be variance between the types of the statement of work that is uniquely different and are required between process A and process B (within-group variance), and there could be variances between the groups of people conducting the statement of work (between group variance).  These two types of variances will feed into the F-statistic result which would allow the researcher to state then whether or not they can reject the null hypothesis that the means between both mean performances are the same.

Quant: Central tendencies and variances

Central to quantitative research and methods is to understand the numerical, ordinal, or categorical dataset and what the data represents. This can be done through either descriptive statistics, where the researcher uses statistics to help describe a data set, or it can be done through inferential statistics, where conclusions can be drawn about the data set (Miller, n.d.).  However, researchers should aim to avoid situations where insights and conclusions are gathered and are drawn from the extreme or non-representative data, and understanding the central tendency can help avoid this scenario. For instance, in data mining for business and industry, current practice is comparing multiple random samples based on their central tendency (Ahlemeyer-Stubbe & Coleman, 2014).  Field (2013) and Schumacker (2014), defined central tendency as an all-encompassing term to help describe the “center of a frequency distribution” through the commonly used measures mean, median, and mode.

Central Tendency

In a symmetrical distribution, the central tendency is where most of the data values tend to occur. Thus we can use mean to help describe the central tendency (Schumacker, 2014).  The mean is the arithmetic average value of the data distribution, which is the sum of all the data values divided by the number of data points in the distribution (Field, 2013). Miller (n.d.) stated that the mean value is best when the data is interval data (equally distributed continuously), and the data is well balanced and not skewed.  However, if the data is heavily skewed, then the median is the best to use since it ignores the extreme values on both ends of the distribution (Miller, n.d.). Medians are easily calculated when the total number of data points in a distribution is an odd number, but if it is an even number of data points, the two most centered values will have to be averaged to obtain the median. Modes can help the researcher to identify patterns and is best for nominal or ordinal data (Miller, 2013). Therefore, the mode is the data value that is the most frequent among the data distribution, and a data distribution can be bimodal, which is having two modes in the distribution, or multi-modal, which is having three or more modes of the distribution (Field, 2013). The median is the data value at the center of the distribution when the data values are placed in ascending order (Field, 2013).

Example:  A random sample of fictitious Twitter user’s follower count consist of {22, 40, 57, 57, 68, 93, 116, 121, 168, 405, 2380, 8746}, the mean would be 1022.75, the mode would be 57, and the median would be 104.5.

Outliers, missing values, and multiplication of a constant, and adding a constant are some factors that can affect the central tendency (Schumacker, 2014).  In the case of outliers, it can draw the mean away from the central tendency and towards the outliers skewing the distribution (Miller, n.d.; Schumacker, 2014). The mean moves to the skew, or extreme values to the data, such as in the example above.  This is the danger in including the extreme values in the central tendency measures.  Then there is a need to know more about the central tendency of the data.  One way is to analyze the mean and median together to understand how skewed the data is and in which direction.  Heavily skewed distributions would heavily increase the distance between these two values, and if the mean less than the median the distribution is skewed negatively (Field, 2013).  To understand the distribution, better other measures like variance and standard deviations could be used.

Variance and Standard Deviation

Variance and standard deviations are considered as measures of dispersion, which talk about how the data values are spread across the data distribution and how they are different from the mean (Field, 2013; Schumacker, 2014).  The difference between the central tendency value and the data value is considered the deviance of the value (Field, 2013).  Squaring each deviance to get rid of negative signs and summing up all those deviances for each and every data value will result in obtaining a sum of squared error value (Field, 2013).   Taking the sum of squared errors value and dividing that up by the number of data values -1, will result in obtaining the variance of the distribution (Field, 2013).  Therefore, the variance is the average deviation between the central tendency and the data (Field, 2013).  Variance is the standard deviation value squared, and the variance is a measured value that is not in the same units of the data set (Field, 2013; Miller, n.d.; Schumacker, 2014).  In the above example, the variance about the mean is 6349719 with a standard deviation about the mean is 2519.865.  While removing the two extreme values the variance about the mean is 12293.34 and the standard deviation about the mean are 110.8754. This helps shows that the variability scores can change drastically with and without the extreme values.  Thus, this illustrates that variability shrinks when there is an agreement in the data (Miller, n.d.).

References:

  • Ahlemeyer-Stubbe, A., & Coleman S. (2014). A practical guide to data mining for business and industry. UK, Wiley-Blackwell. VitalBook file.
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Miller, R. (n.d.). Week 1: Central tendency [Video file]. Retrieved from http://breeze.careeredonline.com/p9fynztexn6/?launcher=false&fcsContent=true&pbMode=normal
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.