Adv DBs: A possible future project?

Below is a possible future research paper on a database related subject.

Title: Using MapReduce to aid in clinical test utilization patterns in the medicine

The motivation:

Efficient processing and analysis of clinical data could aid in better clinical tests on patients, and MapReduce solutions allow for an integrated solution in the medical field, which aids in saving resources when it comes to moving data in and out of storage.

The problem statement (symptom and root cause)

The rates of Sexually Transmitted Infections (STIs) are increasing at alarming rates, could the addition of Roper Saint Francis Clinical Network in the South test utilization patterns into Hadoop with MapReduce reveal patterns in the current STIs population and predict areas where an outbreak may be imminent?

The hypothesis statement (propose a solution and address the root cause)

H0: Data mining in Hadoop with MapReduce will not be able to identify any meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

H1: Data mining in Hadoop with MapReduce can identify a meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

The research questions

Could this study apply to STIs outbreaks rates be generalized into other disease outbreak rates?

Is this application of data-mining in Hadoop with MapReduce the correct way to analyze the data?

The professional significance statement (new contribution to the body of knowledge)

Identifying where an outbreak of any disease (or STIs), via clinical tests utilization patterns has yet to be done according to Mohammed et al (2014), and they have stated that Hadoop with MapReduce is a great tool for clinical work because it has been adopted in similar fields of medicine like bioinformatics.


  • Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce programming framework to clinical big data analysis: Current landscape and future trends. Biodata Mining, 7. doi: – Doctoral Library Advanced Technologies & Aerospace CollectionPokorny, J. (2011).
  • NoSQL databases: A step to database scalability in web environment. In iiWAS ’11 Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services (pp. 278-283). – Doctoral Library ACM Digital Library

Quant: Understanding Variance

If a researcher were to look at a measure of job performance resulting from 2 different manufacturing processes and found that the mean performance of process A was 82.5, and the mean performance of process B was 78.5, they could not automatically assume that process A will consistently outperform process B.  The reason the researchers cannot come to a conclusion until an analysis of variance done to that data.  There could be variance between the types of the statement of work that is uniquely different and are required between process A and process B (within-group variance), and there could be variances between the groups of people conducting the statement of work (between group variance).  These two types of variances will feed into the F-statistic result which would allow the researcher to state then whether or not they can reject the null hypothesis that the means between both mean performances are the same.

Quant: Variances

Variance is considered as measures of average dispersion (Field, 2013; Schumacker, 2014).  Variance is a numerical value that describes how the observed data values are spread across the data distribution and how they differ from the mean on average (Huck, 2011; Field, 2013; Schumacker, 2014).  The smaller the variance indicates that the observed data values are close to the mean and vice versa (Field, 2013). What happens when researchers want to study if the difference between two means from two groups of data is statistically significant from each other? Researchers could use ANOVA, which is an analysis of variances that test whether or not to reject the null hypothesis of the mean of one group is equal to the mean of another group (Huck, 2011; Schumacker, 2014).  ANOVAs usually test categorical independent variables (groups) and continuous dependent variables (Creswell, 2014).  One of the results of a one-way analysis of variance presents in a table the variance between groups and within groups (Huck, 2011).  Schumacker (2014), explained that the variance between groups indicates the variation between the overall grand mean of the groups, while variance within the groups indicates the variance within the means of the groups.  The variances between groups have a degree of freedom equal to the number of groups analyzed – 1, whereas the variance within the groups has a degree of freedom equal to the number of data points within each group – 1 – the number of groups (Huck, 2011).  Information from within and between the groups are used to calculate the F-statistic to establish statistical significance which can allow the researcher to reject or fail to reject their null hypothesis (Field, 2013; Huck, 2011; Schumacker, 2014).


  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Huck, S. W. (2011) Reading Statistics and Research (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.