Parallel Processing: Ada Tasking and State Modeling

Sample Code

1  protected Queue is
2          procedure Enqueue (elt : in Q_Element);     
3                                            -- Insert elt @ the tail end of the queue
4          function First_in_line return Q_Element;
5                                            -- Return the element at the head of the
6                                            -- queue, or no_elt.
7          procedure Dequeue;                  
8                                            -- If the queue is not empty, remove the
9                                            -- element at its head
10 end Queue;
11
12 task type Worker;
13
14 task body Worker is
15   elt : Q_Element;                        -- Element retrieved from queue
16   begin
17      while true loop
18           elt := Queue.First_in_line;     -- Get element at head of queue
19           if elt = no_elt then            -- Let the task loop until there
20              delay 0.1;                   -- is something in the queue
21           else
22               Process (elt);              -- Process elt. This takes some time
23               Queue.Dequeue;              -- Remove element from queue
24           end if;
25     end loop;
26 end Worker;

Comparing and Contrasting Ada’s Protected Function to Java’s (non)synchronized

Java’s safe objects are synchronized objects, which are usually contained in methods that are “synchronized” or “non-synchronized”.  For the non-synchronized methods, the safe objects within these methods are mostly considered to be read-only. Whereas in synchronized methods, safe objects can be written and read, but usually have wait loops at the beginning with a certain wait condition (i.e. while (condition) {wait( );}).  This wait loop forces the thread to wait until when the condition becomes true, in order to prevent multiple threads from editing the same safe object at the same time.   Usually, wait loops located elsewhere in the Java synchronized methods are (uncommon).

Safe objects in Ada are protected objects are in the “protected procedure” or a “protected function”.  Unlike the non-synchronized method in Java (where it should be read-only), the protected function in Ada is read-only.  Java’s synchronized version has a wait function that stalls a thread, in the Ada’s protected entry (i.e. entry x ( ) when condition is …) is only entered when the condition is true, thus you can have multiple entries where data could manipulate in multiple ways similar to an if-else function.  For example, entry x ( ) when condition is true and another one right after could be entry x ( ) when condition is false.  Though, this can be expanded to n different conditions, where.  With these entries, the barrier is always tested first compared to wait.  However, we could requeue (not a subprogram call-thus the tread doesn’t return to the point after the requeue call) on another entry, but it’s uncommon to have them located elsewhere in the program.

Describe what worker does

Workers must get the element (elt) from the first line item in the queue and then loop through the task until there is an element in the queue for which the worker can process the element.  The element is stored in the elt array.  If there is no element delay the process by 0.1 units of time and keep looping.  Once the worker has obtained an element, they can begin processing the element, then we can remove the element from the current queue.

Adapt Queue and work for multiple instances of worker can process data elements

In order for one worker to process each element, we must first do is change “task body worker is” to “protected body worker is” on line 14.  Change “elt: Q_Element;” to procedure get (elt: Q_Element) is in order to get the element from the queue on line 15.

Once there is an element in the first inline of the queue, the worker must first dequeue it in order to process it, this should protect the data and allow for another worker to work on the next first inline element.  Thus, I would be proposing to switch lines 22 to 23 and 23 to 22.  If this isn’t preferred, we can create a new array called work_in_progress where we create a get, put, and remove procedure for this array, which should go before line 22 and then follow my proposed sequence.  This will allow the worker to say I got this element, I will work on it, and if all is successful we don’t need to re-add the element back into the queue and delete it from the work_in_progress array, but I don’t want to hold up other workers from working on other elements.  However, if the worker says I failed to process this array, please return it back into the queue and add it into the elt array again for another worker to process it. To avoid an endless loop, if an element cannot be processed by three different workers we can create another array to store non-compliant elements in and call Dequeue on the main elt array.   However, we can simply and only switch lines 22 and 23 with each other if and only if this change shows that processing the element could never fail.

In Queue line 2 must have entries, for instance “entry Enqueue (elt : in Q_Element) when count >= 0 is … end Enqueue”, to allow for waiting until there are actually element to be added from the array elt.  Doing entries in Queue, would be eliminating the need for the while true loop to search if there is an elt in the first of the line in lines 19, 20, & 21.   Thus, we are making our conditions check first rather than later on in the code. Similarly, we can do a change to line 7 to “entry Dequeue (elt : in Q_Element) when count > 0 is … end Dequeue”, to allow for waiting until there is actually element for deletion from the array elt.  Though this is more for an efficiency issue and allows for the worker to say I got this element and it’s ok to delete from the queue. With all these changes we must make sure that on line 18 we must make sure we are pulling an element from an elt array.

The loop where instances of workers wait is crude

Line 18 and 23 can be combined with Queue.Pop (elt) above the if-statement in on line 18, to avoid the crude loop where threads of workers wait for something in the queue.  The pop allows for no “busy waiting”.  But, we must create a procedure in Queue called procedure pop which returns a query on the first line item on the array elt and removes it.

 

The context with an image

Sometimes as a data scientist or regular scientist, we produce beautiful charts that are chock-full of meaning and data, however to those in the outside world, it can be misconstrued.  To avoid the scenario of misreading your graphs on a dashboard, paper, or even a blog post, sometimes context is needed.  The amount of context needed will depend on the complexity of understanding and severity of misinterpretation. The higher the complexity the more contextual text is needed to help the reader digest the information you are presenting. The higher the severity of misinterpretation, i.e. life-threatening if misread or loss of millions of dollars, should also include more contextual text.

Contextual text can help a reader understand your tables, graphs, or dashboards but not every instance requires the same level of detail throughout.  The following are just meer examples of what light, medium, and heavy context could include:

Light Context (bullet points)

  • Source system or source details
  • Details on allocations impacting objects in a model
  • Details on data joins or algorithms used
  • Data nuances (excludes region x)

Medium Context (Calling out use cases)

  • A succinct explanation of what the user should be able to get out of the report/reporting area, graph, table, or dashboard

Heavy Context (Paragraph Explanations)

  • The best example is the results section of a scientific peer-reviewed journal, which not only has a figure description, but they go into detail about areas to pay attention to, outliers, etc.

Below is an example from the National Hurricane Center’s (NHC) 5-day forecast cone for Tropical Storm Sebastian.  Notice the

“Note: The cone contains the probable path of the storm center but does not show the size of the storm.  Hazardous conditions can occur outside of the cone.” (NHC, 2019).

This line alone falls under light context until you add the key below, which is a succinct explanation of how to read the graphic, making the whole graphic fall under medium context.

download

A secondary image produced originally by the NHC for Hurricane Beryl in 2016 shows an example of a heavy context below the NHC image, when text is added by an app. In this application, where this image is pulled from, the following block of text states the following (Appadvice.com, n.d.):

“This graphic shows an approximate representation of coastal areas under a hurricane warning (red), hurricane watch (pink), tropical storm warning (blue) and tropical storm watch (yellow). The orange circle indicates the current position of the center of the tropical cyclone. The black line, when selected, and dots show the National Hurricane Center (NHC) forecast track of the center at the times indicated. The dot indicating the forecast center location will be black if the cyclone forecast to be tropical and will be white with a black outline if the cycle is …”

750x750bb

Resources:

Strategic Key Elements Case

To understand a corporate strategy, you need to understand the different key elements of a corporate strategy.  Most corporate strategies are showcased in the investor’s annual report, usually in the first 10ish pages.  The key elements to a corporate strategy include: Mission, Scope, Competitive Advantage, Action Program, and Decision-making guidance. Below is a case study from Boeing’s 2017 Annual Report, prior to the 737 Max situation.

Strategy key elements:

  • Mission: “Our purpose and mission is to connect, protect, explore and inspire the world through aerospace innovation. We aspire to be the best in aerospace and an enduring global industrial Champion.” (Page 1)
  • Scope: “…manufacturer of commercial airplanes and defense, space and security systems and a major provider of government and commercial aerospace services…. Our products and tailored services include commercial and military aircraft, satellites, weapons, electronic  and defense system, launch systems, digital aviation services, engineering modifications and maintenance, supply chain services and training” (Table of Contents)
  • Competitive Advantage:
    • Supply Chain integration: “leverages the talents of skilled people working for Boeing suppliers worldwide, including 1.3 million people at 13,600 U.S. companies” (Table of Contents)
    • Intellectual property: “We own numerous patents and have licenses for the use of patents owned by others, which relate to our products and their manufacture. In addition to owning a large portfolio of intellectual property, we also license intellectual property to and from third parties.” (Page 2)
  • Action Program:  “The importance of our purpose and mission demands that we work with the utmost integrity and excellence and embrace the enduring values that define who we are today and the company we aspire to be tomorrow. These core values—integrity, quality, safety, diversity and inclusion, trust and respect, corporate citizenship and stakeholder success—remind us all that how we do our work is every bit as important as the work itself. Living these values also means being best-in-class in community and environmental stewardship.” (Page 8)
  • Decision-Making Guidance:
    • “Our ongoing pursuit of improved affordability and productivity, as well as strong performance across our core business areas…have us well positioned to fund our future.” (Page 7)
    • “Succeeding in rapidly changing global markets requires that we think and do things differently. It demands change, a willingness to embrace it and the agility to both drive and respond to external forces.” (Page 7)
    • “… we continue to make meaningful progress toward building a zero-injury workplace, achieving world-class quality and creating the kind of culture that breaks down organizational barriers, eliminates bureaucracy and unleashes the full capabilities of our people.” (Page 8)

 

Resources:

Parallel Programming: Vector Clocks

Groups of nodes act together, can send messages (multicast) to a group, and the messages are received by all the nodes that are in the group (Sandén, 2011).  If there is a broadcast, all nodes in the system get the same message.  In a multicast, the messages can reach the nodes in a group in a different order: First In First Out order, casual order, or total order (atomic multicast if it is reliable).

Per Sandén (2011), a multicast can occur if the source is a member of the group, but it cannot span across groups in causal order. Two-phase, total order multicast systems can look like a vector clock but they are not, and each message sends or receive will increment on this system by one as they talk between the systems.

Below is an example of a vector clock:

To G1

  • m1 suggested time at 6 and  m2 suggested time at 11
  • m1 commit time at 7 and m2 commit time at 12

To G2

  • m1 suggested time at 7 and  m2 suggested time at 12
  • m1 commit time at 8 and m2 commit time at 13vectorclock

Reference

Parallel Programming: State diagram of Maekawa’s voting algorithm

Sandén (2011) defines state diagrams as a way to show the possible states an object could be on.  He also defines, that events are action verbs that occur on an arrow between two events (if an action doesn’t change the state it can be listed in the state).  Whereas an action can have conditions on them.

Thus, a state diagram shows the transition from state to state as events occur.  An event usually has many occurrences, and they are instantaneous.  Finally, a super-state can encompass multiple states (Sandén, 2011).  An activity is an operation that takes time, and it has the keyword “do /”

The goal of this post was to make a state diagram of Maekawa’s voting algorithm on the “Maekawa’s algorithm” within the “Distributed mutual exclusion” set. This can be done in various ways. One option is the following 5 states:

  • Released and not voted
  • Released and voted
  • Wanted and not voted
  • Wanted and voted
  • Held and voted (for self)

Events are:

  • request_received, etc., for messages arriving from other nodes
  • acquire for when the local node wants the lock
  • release for when the local node gives up the lock.

A possible solution is shown below:

statediagram

Reference

Qualitative Analysis: Coding Project Report of a Virtual Interview Question

The virtual interview question: Explain what being a doctoral student means for you? How has your life changed since starting your doctoral journey?

Description of your coding process

The steps I followed in this coding process were to read the responses once, at least one week before this individual project assignment was due.  This allowed me to think of generic themes, and codes at a super high level throughout the week.  Then after the week was over, I quickly went to wordle.net to create a word cloud on the top 50 most used words in this virtual interview and found out the results below.

wordle

Figure 1: Screenshot for wordle.net results which were used to help develop sub-codes and codes, words that bigger appear more often in the virtual interview than those words that are smaller.

The most telling themes from Figure 1 are: Time, Family, Life, Work, Student, Learning, Frist, Opportunity, Research, People, etc.  This helped create some codes and some of the sub-codes like prioritization, for family, etc.  Figure 1 has also helped me to confirm my ideas for codes that I have been thinking already in my head for the past week, thus I felt ready to begin coding.  After, deciding on the initial set of codes, I did some manual coding, while asking the questions: “What is the person saying? And how they are saying it? And could there be a double meaning in the sentences?”  The last question helped me identify if each sentence in this virtual interview had multiple codes within it.  I used QDA Miner Lite as my software of choice for coding, it is an open-source product and there are plenty of end-user tutorials made by different researchers from many fields on how to effectively use this software effectively on YouTube.  After the initial manual coding, I revisited the initial coding book.  Some of the subcodes that fell under betterment, were moved into the future code as it better fit that theme than just pure betterment. This reanalysis of coding went on for all codes.  As I re-read the responses for the third time, some new subcodes got added as well.  The reason for re-reading this virtual interview a third time was to make sure not many other codes could be created or were missing.

Topical Coding Scheme (Code Book)

The codebook that was derived is as follows:

  • Family
    • For Family
    • Started by Family
    • First in the Family
  • Perseverance
    • Exhausted
    • Pushing through
    • Life Challenges
    • Drive/Motivation
    • Goals
  • Betterment
    • Upgrade skills
    • Personal Growth
    • Maturity
    • Understanding
    • Priority Reanalysis
  • Future
    • More rewarding
    • Better Life
    • Foresight
  • Proving something
    • To others

 

Diagram of findings

Below are images developed through the analytical/automated part QDA Miner Lite:

fig2

Figure 2: Distribution of codes in percentages throughout the virtual interview.

Figure 3: Distribution of codes in frequency throughout the virtual interview.

fig4

Figure 4: Distribution of codes in frequency throughout the virtual interview in terms of a word cloud where more frequent codes appear bigger than less frequent codes.

Brief narrative summary of finding referring to your graphic diagram

Given figures 2-4, one could say that the biggest theme for going into the doctoral program is the prospect of a better life and hoping to change the world, as they more frequently showed up in the interview.  One student states that their degree would open many doors, “Pursuing and obtaining this level of degree would help to open doors that I may not be able to walk through otherwise.” While another student says that hopefully, their research will change the future lives of many “The research that I am going to do will hopefully allow people to truly pursue after their dreams in this ever-changing age, and let the imagination of what is possible within the business world be the limit.” Other students are a bit more practical with their responses stating things like “…move up in my organization and make contributions to the existing knowledge” and finally “More opportunities open for you as well as more responsibility for being credible and usefulness as a cog in the system”

Another concept that kept repeating here is that this is done for family, and because of family work, and school, the life of a doctoral student in this class has to be reprioritized (hence the code priority reanalysis).  This is primarily seen as all forms of graphical output show that these are the two most significant things that drive towards the degree.  One student went to one extreme, “Excluding family and school members, I am void of the three ‘Ps’ (NO – people, pets, or plants). I quit my full-time job and will be having the TV signal turned off after the Super Bowl to force additional focus.”  Another student said that time was the most important thing they had and that it has changed significantly, “The most tangible thing that has changed in my life since I became a doctoral student has been my schedule.  Since this term began I have put myself on a strict schedule designating specific time for studies, my wife, and time for myself.”  Finally, another student says balance is key for them: “Having to balance family time, work, school, and other social responsibilities, has been another adjusted change while on this educational journey. The support of my family has been very instrumental in helping me to succeed and the journey has been a great experience thus far.”  There are 7 instances in which these two codes overlap/included within each other, which apparently happen 80% of the time.

Thus, from this virtual interview, I am able to conclude that family is mentioned with priority reanalysis in order to meet the goal of the doctoral degree and that time management a component of priority reanalysis is key.  There are students that take this reanalysis to the extreme as aforementioned, but if they feel that is the only way they could accomplish this degree in a timely manner, then who am I to judge.  After all, it is the job of the researcher, when coding to be non-biased.  However, the family could drive people to complete the degree, it is the prospects of a better life and changing the world for the better is what was mentioned most.

Appendix A

An output file from qualitative software can be generated by using QDA Miner Lite.

 

LSAT Conditionals and CS Conditionals

In the past few months, I have been studying for the LSAT exam. Yes, I am contemplating Law School.  Law school will be a topic for another day.  However, I came across a few points that are extremely interesting and could spark discussion in the computer science field.  In the field of computer science, we have a thing called Loops in our coding languages.  One of the most common loops is called an IF-THEN loops, which is one of many conditional phrases. However, the LSAT has made me realized that there is more to the IF-THEN conditional statements in the LSAT, and here is why (Teti et al., 2013):
  1. If X then Y (Simple IF-THEN loop)
  2. If not Y then not X (This is the contra-positive of 1)
  3. X If and only if Y means X and Y
  4. X Unless Y means if not X then Y
where X here is the sufficient variable whereas Y is the necessary variable. The phrase “If” can be substituted for “All,” “Any,” “Every,” and “When” (Teti et al., 2013). Whereas the phrase for “then” can be substituted for the phrase “only,” or “only if.” Remember, that a conditional phrase like the ones above can introduce a relationship between the variables, but it doesn’t establish anything concrete. A sufficient variable (X) is enough to guarantee Y, but Y is not enough on its own to guarantee X.
Subsequently, with any Loop, we have to look at conjunctive “and” or disjunctive “or” statements.
  1. Both X and Y = X + Y
  2. Either X or Y = X or Y
  3. Not both X or Y = X or Y
  4. Neither X or Y = X + Y

We should note that an “or” statement can also allow for the possibility of both (Teti et al., 2013). Additionally, the LSAT adds some nuance to the conditional phrase by adding an “EXCEPT” clause.  For instance (Teti et al. 2013):

  1. Must be true EXCEPT = Could be false
  2. Could be true EXCEPT = Must be false
  3. Could be false EXCEPT = Must be true
  4. Must be false EXCEPT = Could be true
The LSAT views these loops, conjunctive, disjunctive, and conditional phrases a bit more nuance than computer scientists do and maybe we can combine some of this nuance in future coding to get more nuance code and results.
Though some people may state that this whole post is overkill and why do we have to look into such nuance. Each one of the above bullets is necessary and has value. It has been created in the lexicon for a particular reason. We can easily decompose each of these, and then map these out in simpler terms with a programming language. However, to sufficiently capture these nuance characteristics of these conditional phrases, we can create really nasty pieces of convoluted code.
Resources:
  • Teti, T., Teti, J., and Riley, M. (2013). The Blueprint for LSAT Logic Games. Blueprint LSAT Preparation.

A/B Testing

Are you a HiPPO? HiPPO stands for the Highest-Paid Person’s Opinion who designed websites or gives their opinion on how things ought to be (Christian, 2012). This may not be a good thing, because the HiPPOs may not be the best person to get the maximum traffic to and through your website. Proponents for A/B testing state that the advantages of using A/B testing are greater than the time it takes to conduct it in the first place (Christian, 2012; Patel, n.d.). Whereas, Patel (n.d.), further claims that  “A/B tests, done consistently, can improve your bottom line substantially.”  Therefore, A/B testing helps the data scientist to narrow down which element/variable makes an effective difference towards their goal, i.e. click-through rate within a website to generate more revenue (Christian, 2012; Patel, n.d.; Unbounce.com, n.d.).
First, you need to know what to test or which elements/variables you want to test. It is key to  know what you want to test, the current baseline result, what you are testing for, and the goal you want to reach (Patel, n.d.) If you have a click funnel for your audience and they are dropping out at a certain level, you may want to use that area to improve fallout rates.  Once you know what to test, make a list of all the variables you would like to test (Christian, 2012; Patel, n.d.; Unbounce.com, n.d.):
  • location
  • color
  • button type
  • surrounding type
  • text font
  • font size
  • any graphic you use
  • product descriptions
  • sales copy
  • verbiage
  • different offers (50% off, 35% off, free sample, etc.)
  • a whole page
  • a whole landing page

Then you set your control element/variable, and it is essentially what you have now, and you call that your A. Meanwhile, the element/variable you want to test as B, to be run simultaneously with the control (Christian, 2012; Patel, n.d.). The A and B variables are also known as variants, the challenger is the B variable, and the champion variable is the one that outperforms the others (unbounce.com, n.d.)  For instance, 100% of the audience will be split into 50% of your site with variable A and the other 50% of your site with variable B. The split can vary from 50/50 to 60/40 to 70/30, etc. and it depends on how much weight you want to assign to the challenger variable (unbounce.com, n.d).

Another thing to consider is what statistical test you want to apply to the A/B test:
  • If Gaussian is the assumed distribution (i.e. average revenue per paying user), you can use the Unpaired T-test and/or Student T-test (Amazon, 2015; Box et al., 1987; Pereira, 2007).
  • If Binomial is the assumed distribution (i.e.click through rate), you can use Fisher’s exact test and/or Bernard’s test (Amazon, 2015).
  • If Poisson is the assumed distribution (i.e. transactions per paying user), you can use the E-test and/or C-test (Krishnamoorthy & Thomson, 2004).
  • If Multinomial is the assumed distribution (i.e. the number of each product purchased), you can use the Chi-square test.
  • If the assumed distribution is unknown, you can use the Mann-Whitney U test and/or Gibbs Sampling.
Testing can go on for a few days to a few weeks depending on the amount of traffic you get (Patel, n.d.).  Something like Facebook can start an A/B test on Monday and have a reported result by Friday, whereas my current state of the blog may have to take about a month or two. However, too long of a test for the wrong set of traffic or just in general can include confounding variables, which will skew your results.
To test three variations, also known as a multivariate test, according to Patel (n.d.), you need to set up an A/B test, a B/C test, and a C/A test and you want to give it a bit more time to have enough data.  When doing a multivariate test, you want to give them equal weight to pick a champion variable quickly (unbounce.com).
Another great article to view is from Kolowich (n.d), which provides a checklist for successfully conducting an A/B Test.
Resources

Adv DBs: Data Abstractions

Data Abstraction

Text can be abstracted for information and knowledge through either hard clustering where a word has only one connection or soft clustering where a word can have multiple connections to other words (Kulkarni & Kinariwala, 2013).  Clustering, in general, is grouping things together with similar characteristics.  It is hard to do hard clustering with sentences of a paragraph or even prose because they are interconnected with the sentences above and below it.  Also, clusters within prose can overlap with each other.  Thus, it is proposed that soft clustering should be used for the analysis of sentences within the prose.  The method proposed in Kulkarni & Kinariwala is Page Rank, in order to show the importance of a sentence(s) within a document (thus helping summarize a document). The weakness of this paper lies with the fact that they propose an idea without testing it.  They didn’t develop any code or analyzed any data set to say whether their hypothesis on page rank was correct.  Thus, this was a wonderful thought experiment.  The strength if proven correct with other studies is that it maps out the limitations & strengths of hard and soft clustering in data mining within prose and between the prose of a similar nature.

Management issues in systems development

Information is seen as of great value to humanitarian efforts to accomplish their missions.  In Van de Walle & Comes (2015), they state that the United Nations had delivered methods for humanitarian Information Management revolving around checking, sharing, and use of the data.  Checking data revolves around reliability and verifiability, sharing data revolves around interoperability (data formats), accessibility, and sustainability, whereas the use of data deals with timeliness and relevance.  After interviewing humanitarians in two different disaster scenarios, Syria and Typhoon Haiyan for about 1-1.5 hours, they were able to conclude that standard processes can be followed for natural disasters like a landfalling hurricane.  Standard processes lent itself to inflexibility and not meeting all the intricate needs. In a more complicated relief effort like in Syria, confidentiality and unreliable data sources (sometimes coming in the format like an old spy movie, under the table, etc.), affected the entire process.  Finally, this small sample size of two events and humanitarian people interviewed suggest that further research is definitely needed before generalizations in developing systems of Information Management between natural disasters and geopolitical disasters can be made. The main strength of this paper is the analysis of breaking down information management of disasters with respect to standards imposed by the UN.  It also illustrates that information management is end-to-end.  My research hopes to help improve pre-disaster conditions and their research covers aid for post-disaster.  The same disaster, Hurricane landfalling, has a change in key information that is needed to carry out their respective tasks.  In other words, hurricane wind speeds are no longer needed after it passed over a city and left a wake of destruction, and the death toll is not important before the hurricane makes landfall.   But, we need wind speeds to improve forecasts and mitigate death tolls, and we need the current death toll, to make sure we can keep it from rising after the disaster has struck.

References

  • Van de Walle, B. & Comes, T. (2015) On the Nature of Information Management in Complex and Natural Disasters. Procedia Engineering, Pages 403-411.
  • Kulkarni, B. M., & Kinariwala, S. A. (2013). Review on Fuzzy Approach to Sentence Level Text Clustering. International Journal of Scientific Research and Education. Pages 3845-3850.

Adv DBs: A possible future project?

Below is a possible future research paper on a database related subject.

Title: Using MapReduce to aid in clinical test utilization patterns in the medicine

The motivation:

Efficient processing and analysis of clinical data could aid in better clinical tests on patients, and MapReduce solutions allow for an integrated solution in the medical field, which aids in saving resources when it comes to moving data in and out of storage.

The problem statement (symptom and root cause)

The rates of Sexually Transmitted Infections (STIs) are increasing at alarming rates, could the addition of Roper Saint Francis Clinical Network in the South test utilization patterns into Hadoop with MapReduce reveal patterns in the current STIs population and predict areas where an outbreak may be imminent?

The hypothesis statement (propose a solution and address the root cause)

H0: Data mining in Hadoop with MapReduce will not be able to identify any meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

H1: Data mining in Hadoop with MapReduce can identify a meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

The research questions

Could this study apply to STIs outbreaks rates be generalized into other disease outbreak rates?

Is this application of data-mining in Hadoop with MapReduce the correct way to analyze the data?

The professional significance statement (new contribution to the body of knowledge)

Identifying where an outbreak of any disease (or STIs), via clinical tests utilization patterns has yet to be done according to Mohammed et al (2014), and they have stated that Hadoop with MapReduce is a great tool for clinical work because it has been adopted in similar fields of medicine like bioinformatics.

Resources

  • Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce programming framework to clinical big data analysis: Current landscape and future trends. Biodata Mining, 7. doi:http://dx.doi.org/10.1186/1756-0381-7-22 – Doctoral Library Advanced Technologies & Aerospace CollectionPokorny, J. (2011).
  • NoSQL databases: A step to database scalability in web environment. In iiWAS ’11 Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services (pp. 278-283). – Doctoral Library ACM Digital Library