Adv DBs: Data Abstractions

Data Abstraction

Text can be abstracted for information and knowledge through either hard clustering where a word has only one connection or soft clustering where a word can have multiple connections to other words (Kulkarni & Kinariwala, 2013).  Clustering, in general, is grouping things together with similar characteristics.  It is hard to do hard clustering with sentences of a paragraph or even prose because they are interconnected with the sentences above and below it.  Also, clusters within prose can overlap with each other.  Thus, it is proposed that soft clustering should be used for the analysis of sentences within the prose.  The method proposed in Kulkarni & Kinariwala is Page Rank, in order to show the importance of a sentence(s) within a document (thus helping summarize a document). The weakness of this paper lies with the fact that they propose an idea without testing it.  They didn’t develop any code or analyzed any data set to say whether their hypothesis on page rank was correct.  Thus, this was a wonderful thought experiment.  The strength if proven correct with other studies is that it maps out the limitations & strengths of hard and soft clustering in data mining within prose and between the prose of a similar nature.

Management issues in systems development

Information is seen as of great value to humanitarian efforts to accomplish their missions.  In Van de Walle & Comes (2015), they state that the United Nations had delivered methods for humanitarian Information Management revolving around checking, sharing, and use of the data.  Checking data revolves around reliability and verifiability, sharing data revolves around interoperability (data formats), accessibility, and sustainability, whereas the use of data deals with timeliness and relevance.  After interviewing humanitarians in two different disaster scenarios, Syria and Typhoon Haiyan for about 1-1.5 hours, they were able to conclude that standard processes can be followed for natural disasters like a landfalling hurricane.  Standard processes lent itself to inflexibility and not meeting all the intricate needs. In a more complicated relief effort like in Syria, confidentiality and unreliable data sources (sometimes coming in the format like an old spy movie, under the table, etc.), affected the entire process.  Finally, this small sample size of two events and humanitarian people interviewed suggest that further research is definitely needed before generalizations in developing systems of Information Management between natural disasters and geopolitical disasters can be made. The main strength of this paper is the analysis of breaking down information management of disasters with respect to standards imposed by the UN.  It also illustrates that information management is end-to-end.  My research hopes to help improve pre-disaster conditions and their research covers aid for post-disaster.  The same disaster, Hurricane landfalling, has a change in key information that is needed to carry out their respective tasks.  In other words, hurricane wind speeds are no longer needed after it passed over a city and left a wake of destruction, and the death toll is not important before the hurricane makes landfall.   But, we need wind speeds to improve forecasts and mitigate death tolls, and we need the current death toll, to make sure we can keep it from rising after the disaster has struck.


  • Van de Walle, B. & Comes, T. (2015) On the Nature of Information Management in Complex and Natural Disasters. Procedia Engineering, Pages 403-411.
  • Kulkarni, B. M., & Kinariwala, S. A. (2013). Review on Fuzzy Approach to Sentence Level Text Clustering. International Journal of Scientific Research and Education. Pages 3845-3850.

Adv DBs: A possible future project?

Below is a possible future research paper on a database related subject.

Title: Using MapReduce to aid in clinical test utilization patterns in the medicine

The motivation:

Efficient processing and analysis of clinical data could aid in better clinical tests on patients, and MapReduce solutions allow for an integrated solution in the medical field, which aids in saving resources when it comes to moving data in and out of storage.

The problem statement (symptom and root cause)

The rates of Sexually Transmitted Infections (STIs) are increasing at alarming rates, could the addition of Roper Saint Francis Clinical Network in the South test utilization patterns into Hadoop with MapReduce reveal patterns in the current STIs population and predict areas where an outbreak may be imminent?

The hypothesis statement (propose a solution and address the root cause)

H0: Data mining in Hadoop with MapReduce will not be able to identify any meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

H1: Data mining in Hadoop with MapReduce can identify a meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

The research questions

Could this study apply to STIs outbreaks rates be generalized into other disease outbreak rates?

Is this application of data-mining in Hadoop with MapReduce the correct way to analyze the data?

The professional significance statement (new contribution to the body of knowledge)

Identifying where an outbreak of any disease (or STIs), via clinical tests utilization patterns has yet to be done according to Mohammed et al (2014), and they have stated that Hadoop with MapReduce is a great tool for clinical work because it has been adopted in similar fields of medicine like bioinformatics.


  • Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce programming framework to clinical big data analysis: Current landscape and future trends. Biodata Mining, 7. doi: – Doctoral Library Advanced Technologies & Aerospace CollectionPokorny, J. (2011).
  • NoSQL databases: A step to database scalability in web environment. In iiWAS ’11 Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services (pp. 278-283). – Doctoral Library ACM Digital Library