Adv DBs: A possible future project?

Below is a possible future research paper on a database related subject.

Title: Using MapReduce to aid in clinical test utilization patterns in the medicine

The motivation:

Efficient processing and analysis of clinical data could aid in better clinical tests on patients, and MapReduce solutions allow for an integrated solution in the medical field, which aids in saving resources when it comes to moving data in and out of storage.

The problem statement (symptom and root cause)

The rates of Sexually Transmitted Infections (STIs) are increasing at alarming rates, could the addition of Roper Saint Francis Clinical Network in the South test utilization patterns into Hadoop with MapReduce reveal patterns in the current STIs population and predict areas where an outbreak may be imminent?

The hypothesis statement (propose a solution and address the root cause)

H0: Data mining in Hadoop with MapReduce will not be able to identify any meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

H1: Data mining in Hadoop with MapReduce can identify a meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

The research questions

Could this study apply to STIs outbreaks rates be generalized into other disease outbreak rates?

Is this application of data-mining in Hadoop with MapReduce the correct way to analyze the data?

The professional significance statement (new contribution to the body of knowledge)

Identifying where an outbreak of any disease (or STIs), via clinical tests utilization patterns has yet to be done according to Mohammed et al (2014), and they have stated that Hadoop with MapReduce is a great tool for clinical work because it has been adopted in similar fields of medicine like bioinformatics.

Resources

  • Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce programming framework to clinical big data analysis: Current landscape and future trends. Biodata Mining, 7. doi:http://dx.doi.org/10.1186/1756-0381-7-22 – Doctoral Library Advanced Technologies & Aerospace CollectionPokorny, J. (2011).
  • NoSQL databases: A step to database scalability in web environment. In iiWAS ’11 Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services (pp. 278-283). – Doctoral Library ACM Digital Library

Adv DBs: Data warehouses and OLAP

Data warehouses allow for people with decision power to locate the adequate data quickly from one location that spans across multiple functional departments and is very well integrated to produce reports and in-depth analysis to make effective decisions (MUSE, 2015a). Data could be stored in n-dimensional data cubes that can be dissected, filtered through, rolled up into a dynamic application called Online analytical processing (OLAP). OLAP can be its own system or part of a data warehouse, and if it’s combined with data mining tools it creates a decision support system (DSS) to uncover/discover hidden relationships within the data (MUSE, 2015b). DSS needs both a place to store data and a way to store data.  The data warehouse doesn’t solve they “Why?” questions, but the “How?, What?, When?, Where?” and that is where OLAP helps.  We want to extract as much knowledge as possible for decision making from these systems, hence this explains why we need both in DSS to solve all questions not just a subset.  But, as aforementioned that data mining tools are also needed for a DSS.

Data Warehouses

Discovering knowledge through archives of data from multiple sources in a consolidated and integrated way is what this warehouse does best.  They are subject-oriented (organized by customers, products, sales, and not in the invoice, product sales), integrated (data from different sources in the enterprise perhaps in different formats), time-variant (varies with respect to time), and nonvolatile (new data is appended not replacing old).  Suitable applications to feed data can be mainframes, proprietary file systems, servers, internal workstations, external website data, etc., which can be used for analysis and discovering knowledge for effective data-based decision making.  Detailed data can be stored online if it can help support/supplement summarized data, so data warehouses can be technically light due to summarized data.  Summarized data, which is updated automatically as new data enters the warehouse, mainly help improve query speeds. So, where is the detailed data: offline (Connolly & Begg, 2015).  Looking into the architecture of this system:

The ODS, Operational data store, holds the data for staging into the data warehouse.  From staging the load manager performs all Extraction and Loading functions to the data into the warehouse, meanwhile, the warehouse manager performs all the Transformation functions to the data into the warehouse.  The query manager performs all the queries into the data warehouse. Metadata (definition of the data stored and its units) are used by all these processes and exist in the warehouse as well (Connolly & Begg, 2015).

 OLAP

Using analytics to answer the “Why?” questions from data that is placed in an n-dimensional aggregate view of the data that is a dynamical system, sets this apart from other query systems.  OLAP is more complex than statistical analysis on aggregated data, it’s more of a slice and dice with time series and data modeling.  OLAP servers come in four main flavors: Multidimensional OLAP (MOLAP: uses multidimensional databases where data is stored per usage), Relational OLAP (ROLAP: supports relational DBMS products with a metadata layer), Hybrid OLAP (HOLAP: certain data is ROLAP and other is in MOLAP), and Desktop OLAP (DOLAP: usually for small file extracts, data is stored in client files/systems like a laptop/desktop).

DSS, OLAP, and Data Warehouse Application

Car insurance claims DSS.  Insurance companies can use this system to analyze patterns of driving from people, what damage can or cannot occur due to an accident, and why someone might claim false damages to fix their cars or cash out.  Thus, their systems can define who, what, when, where, why and how per accident against all other accidents (they can slice and dice by state, type of accident, vehicle types involved, etc) they have processed to help them resolve if a claim is legitimate.

References:

Data Tools: WEKA

WEKA

The Java based, open sourced, and platform independent Waikato Environment for Knowledge Analysis (WEKA) tool, for data preprocessing, predictive data analytics, and facilitation interpretations and evaluation (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Miranda, n.d.; Xia & Gong, 2014).  It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (WEKA, n.d). Also, WEKA contains classification, clustering, association rules, regression, and visualization capabilities, in particular, the C4.5 decision tree predictive data analytics algorithm (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011). Here WEKA is an open source data and text mining software tool, thus it is free to use. Therefore there are no costs associated with this software solution.

WEKA can be applied to big data (WEKA, n.d.) and SQL Databases (Patel & Donga, 2015). Subsequently, WEKA has been used in many research studies that are involved in big data analytics (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011; Parkavi & Sasikumar, 2016; Xia & Gong, 2014). For instance, Barak and Modarres (2015) used WEKA for decision tree analysis on predicting stock risks and returns.

The fact that it has been using in this many research studies is that the reliability and validity of the software are high and well established.  Even in a study comparing WEKA with 12 other data analytics tools, is one of two apps studied that have a classification, regression, and clustering algorithms (Gera & Goel, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if one can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage. Another disadvantage of WEKA is that it cannot handle raw data directly, meaning the data had to be preprocessed before it is entered into the software package and analyzed (Hoonlor, 2011). WEKA cannot even import excel files, data in Excel have to be converted into CSV format to be usable within the system (Miranda, n.d.)

References:

  • Dogan, N., & Tanrikulu, Z. (2013). A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Information Technology and Management, 14(2), 105-124. doi:http://dx.doi.org/10.1007/s10799-012-0135-7
  • Gera, M., & Goel, S. (2015). Data Mining -Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
  • Kumar, D., & Fet, D. (2011). Performance Analysis of Various Data Mining Algorithms: A Review. International Journal of Computer Applications, 32(6), 9–16.
  • Miranda, S. (n.d.). An Introduction to Social Analytics : Concepts and Methods.
  • Parkavi, S. & Sasikumar, S. (2016). Prediction of Commodities Market by Using Data Mining Technique. i-Manager’s Journal on Computer Science.
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).
  • WEKA (n.d.) WEKA 3: Data Mining Software in Java. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
  • Xia, B. S., & Gong, P. (2014). Review of business intelligence through data analysis. Benchmarking, 21(2), 300–311. http://doi.org/http://dx.doi.org/10.1108/BIJ-08-2012-0051