Parallel Programming: Vector Clocks

Groups of nodes act together, can send messages (multicast) to a group, and the messages are received by all the nodes that are in the group (Sandén, 2011).  If there is a broadcast, all nodes in the system get the same message.  In a multicast, the messages can reach the nodes in a group in a different order: First In First Out order, casual order, or total order (atomic multicast if it is reliable).

Per Sandén (2011), a multicast can occur if the source is a member of the group, but it cannot span across groups in causal order. Two-phase, total order multicast systems can look like a vector clock but they are not, and each message sends or receive will increment on this system by one as they talk between the systems.

Below is an example of a vector clock:

To G1

  • m1 suggested time at 6 and  m2 suggested time at 11
  • m1 commit time at 7 and m2 commit time at 12

To G2

  • m1 suggested time at 7 and  m2 suggested time at 12
  • m1 commit time at 8 and m2 commit time at 13vectorclock

Reference

Parallel Programming: State diagram of Maekawa’s voting algorithm

Sandén (2011) defines state diagrams as a way to show the possible states an object could be on.  He also defines, that events are action verbs that occur on an arrow between two events (if an action doesn’t change the state it can be listed in the state).  Whereas an action can have conditions on them.

Thus, a state diagram shows the transition from state to state as events occur.  An event usually has many occurrences, and they are instantaneous.  Finally, a super-state can encompass multiple states (Sandén, 2011).  An activity is an operation that takes time, and it has the keyword “do /”

The goal of this post was to make a state diagram of Maekawa’s voting algorithm on the “Maekawa’s algorithm” within the “Distributed mutual exclusion” set. This can be done in various ways. One option is the following 5 states:

  • Released and not voted
  • Released and voted
  • Wanted and not voted
  • Wanted and voted
  • Held and voted (for self)

Events are:

  • request_received, etc., for messages arriving from other nodes
  • acquire for when the local node wants the lock
  • release for when the local node gives up the lock.

A possible solution is shown below:

statediagram

Reference

Qualitative Analysis: Coding Project Report of a Virtual Interview Question

The virtual interview question: Explain what being a doctoral student means for you? How has your life changed since starting your doctoral journey?

Description of your coding process

The steps I followed in this coding process were to read the responses once, at least one week before this individual project assignment was due.  This allowed me to think of generic themes, and codes at a super high level throughout the week.  Then after the week was over, I quickly went to wordle.net to create a word cloud on the top 50 most used words in this virtual interview and found out the results below.

wordle

Figure 1: Screenshot for wordle.net results which were used to help develop sub-codes and codes, words that bigger appear more often in the virtual interview than those words that are smaller.

The most telling themes from Figure 1 are: Time, Family, Life, Work, Student, Learning, Frist, Opportunity, Research, People, etc.  This helped create some codes and some of the sub-codes like prioritization, for family, etc.  Figure 1 has also helped me to confirm my ideas for codes that I have been thinking already in my head for the past week, thus I felt ready to begin coding.  After, deciding on the initial set of codes, I did some manual coding, while asking the questions: “What is the person saying? And how they are saying it? And could there be a double meaning in the sentences?”  The last question helped me identify if each sentence in this virtual interview had multiple codes within it.  I used QDA Miner Lite as my software of choice for coding, it is an open-source product and there are plenty of end-user tutorials made by different researchers from many fields on how to effectively use this software effectively on YouTube.  After the initial manual coding, I revisited the initial coding book.  Some of the subcodes that fell under betterment, were moved into the future code as it better fit that theme than just pure betterment. This reanalysis of coding went on for all codes.  As I re-read the responses for the third time, some new subcodes got added as well.  The reason for re-reading this virtual interview a third time was to make sure not many other codes could be created or were missing.

Topical Coding Scheme (Code Book)

The codebook that was derived is as follows:

  • Family
    • For Family
    • Started by Family
    • First in the Family
  • Perseverance
    • Exhausted
    • Pushing through
    • Life Challenges
    • Drive/Motivation
    • Goals
  • Betterment
    • Upgrade skills
    • Personal Growth
    • Maturity
    • Understanding
    • Priority Reanalysis
  • Future
    • More rewarding
    • Better Life
    • Foresight
  • Proving something
    • To others

 

Diagram of findings

Below are images developed through the analytical/automated part QDA Miner Lite:

fig2

Figure 2: Distribution of codes in percentages throughout the virtual interview.

Figure 3: Distribution of codes in frequency throughout the virtual interview.

fig4

Figure 4: Distribution of codes in frequency throughout the virtual interview in terms of a word cloud where more frequent codes appear bigger than less frequent codes.

Brief narrative summary of finding referring to your graphic diagram

Given figures 2-4, one could say that the biggest theme for going into the doctoral program is the prospect of a better life and hoping to change the world, as they more frequently showed up in the interview.  One student states that their degree would open many doors, “Pursuing and obtaining this level of degree would help to open doors that I may not be able to walk through otherwise.” While another student says that hopefully, their research will change the future lives of many “The research that I am going to do will hopefully allow people to truly pursue after their dreams in this ever-changing age, and let the imagination of what is possible within the business world be the limit.” Other students are a bit more practical with their responses stating things like “…move up in my organization and make contributions to the existing knowledge” and finally “More opportunities open for you as well as more responsibility for being credible and usefulness as a cog in the system”

Another concept that kept repeating here is that this is done for family, and because of family work, and school, the life of a doctoral student in this class has to be reprioritized (hence the code priority reanalysis).  This is primarily seen as all forms of graphical output show that these are the two most significant things that drive towards the degree.  One student went to one extreme, “Excluding family and school members, I am void of the three ‘Ps’ (NO – people, pets, or plants). I quit my full-time job and will be having the TV signal turned off after the Super Bowl to force additional focus.”  Another student said that time was the most important thing they had and that it has changed significantly, “The most tangible thing that has changed in my life since I became a doctoral student has been my schedule.  Since this term began I have put myself on a strict schedule designating specific time for studies, my wife, and time for myself.”  Finally, another student says balance is key for them: “Having to balance family time, work, school, and other social responsibilities, has been another adjusted change while on this educational journey. The support of my family has been very instrumental in helping me to succeed and the journey has been a great experience thus far.”  There are 7 instances in which these two codes overlap/included within each other, which apparently happen 80% of the time.

Thus, from this virtual interview, I am able to conclude that family is mentioned with priority reanalysis in order to meet the goal of the doctoral degree and that time management a component of priority reanalysis is key.  There are students that take this reanalysis to the extreme as aforementioned, but if they feel that is the only way they could accomplish this degree in a timely manner, then who am I to judge.  After all, it is the job of the researcher, when coding to be non-biased.  However, the family could drive people to complete the degree, it is the prospects of a better life and changing the world for the better is what was mentioned most.

Appendix A

An output file from qualitative software can be generated by using QDA Miner Lite.

 

LSAT Conditionals and CS Conditionals

In the past few months, I have been studying for the LSAT exam. Yes, I am contemplating Law School.  Law school will be a topic for another day.  However, I came across a few points that are extremely interesting and could spark discussion in the computer science field.  In the field of computer science, we have a thing called Loops in our coding languages.  One of the most common loops is called an IF-THEN loops, which is one of many conditional phrases. However, the LSAT has made me realized that there is more to the IF-THEN conditional statements in the LSAT, and here is why (Teti et al., 2013):
  1. If X then Y (Simple IF-THEN loop)
  2. If not Y then not X (This is the contra-positive of 1)
  3. X If and only if Y means X and Y
  4. X Unless Y means if not X then Y
where X here is the sufficient variable whereas Y is the necessary variable. The phrase “If” can be substituted for “All,” “Any,” “Every,” and “When” (Teti et al., 2013). Whereas the phrase for “then” can be substituted for the phrase “only,” or “only if.” Remember, that a conditional phrase like the ones above can introduce a relationship between the variables, but it doesn’t establish anything concrete. A sufficient variable (X) is enough to guarantee Y, but Y is not enough on its own to guarantee X.
Subsequently, with any Loop, we have to look at conjunctive “and” or disjunctive “or” statements.
  1. Both X and Y = X + Y
  2. Either X or Y = X or Y
  3. Not both X or Y = X or Y
  4. Neither X or Y = X + Y

We should note that an “or” statement can also allow for the possibility of both (Teti et al., 2013). Additionally, the LSAT adds some nuance to the conditional phrase by adding an “EXCEPT” clause.  For instance (Teti et al. 2013):

  1. Must be true EXCEPT = Could be false
  2. Could be true EXCEPT = Must be false
  3. Could be false EXCEPT = Must be true
  4. Must be false EXCEPT = Could be true
The LSAT views these loops, conjunctive, disjunctive, and conditional phrases a bit more nuance than computer scientists do and maybe we can combine some of this nuance in future coding to get more nuance code and results.
Though some people may state that this whole post is overkill and why do we have to look into such nuance. Each one of the above bullets is necessary and has value. It has been created in the lexicon for a particular reason. We can easily decompose each of these, and then map these out in simpler terms with a programming language. However, to sufficiently capture these nuance characteristics of these conditional phrases, we can create really nasty pieces of convoluted code.
Resources:
  • Teti, T., Teti, J., and Riley, M. (2013). The Blueprint for LSAT Logic Games. Blueprint LSAT Preparation.

A/B Testing

Are you a HiPPO? HiPPO stands for the Highest-Paid Person’s Opinion who designed websites or gives their opinion on how things ought to be (Christian, 2012). This may not be a good thing, because the HiPPOs may not be the best person to get the maximum traffic to and through your website. Proponents for A/B testing state that the advantages of using A/B testing are greater than the time it takes to conduct it in the first place (Christian, 2012; Patel, n.d.). Whereas, Patel (n.d.), further claims that  “A/B tests, done consistently, can improve your bottom line substantially.”  Therefore, A/B testing helps the data scientist to narrow down which element/variable makes an effective difference towards their goal, i.e. click-through rate within a website to generate more revenue (Christian, 2012; Patel, n.d.; Unbounce.com, n.d.).
First, you need to know what to test or which elements/variables you want to test. It is key to  know what you want to test, the current baseline result, what you are testing for, and the goal you want to reach (Patel, n.d.) If you have a click funnel for your audience and they are dropping out at a certain level, you may want to use that area to improve fallout rates.  Once you know what to test, make a list of all the variables you would like to test (Christian, 2012; Patel, n.d.; Unbounce.com, n.d.):
  • location
  • color
  • button type
  • surrounding type
  • text font
  • font size
  • any graphic you use
  • product descriptions
  • sales copy
  • verbiage
  • different offers (50% off, 35% off, free sample, etc.)
  • a whole page
  • a whole landing page

Then you set your control element/variable, and it is essentially what you have now, and you call that your A. Meanwhile, the element/variable you want to test as B, to be run simultaneously with the control (Christian, 2012; Patel, n.d.). The A and B variables are also known as variants, the challenger is the B variable, and the champion variable is the one that outperforms the others (unbounce.com, n.d.)  For instance, 100% of the audience will be split into 50% of your site with variable A and the other 50% of your site with variable B. The split can vary from 50/50 to 60/40 to 70/30, etc. and it depends on how much weight you want to assign to the challenger variable (unbounce.com, n.d).

Another thing to consider is what statistical test you want to apply to the A/B test:
  • If Gaussian is the assumed distribution (i.e. average revenue per paying user), you can use the Unpaired T-test and/or Student T-test (Amazon, 2015; Box et al., 1987; Pereira, 2007).
  • If Binomial is the assumed distribution (i.e.click through rate), you can use Fisher’s exact test and/or Bernard’s test (Amazon, 2015).
  • If Poisson is the assumed distribution (i.e. transactions per paying user), you can use the E-test and/or C-test (Krishnamoorthy & Thomson, 2004).
  • If Multinomial is the assumed distribution (i.e. the number of each product purchased), you can use the Chi-square test.
  • If the assumed distribution is unknown, you can use the Mann-Whitney U test and/or Gibbs Sampling.
Testing can go on for a few days to a few weeks depending on the amount of traffic you get (Patel, n.d.).  Something like Facebook can start an A/B test on Monday and have a reported result by Friday, whereas my current state of the blog may have to take about a month or two. However, too long of a test for the wrong set of traffic or just in general can include confounding variables, which will skew your results.
To test three variations, also known as a multivariate test, according to Patel (n.d.), you need to set up an A/B test, a B/C test, and a C/A test and you want to give it a bit more time to have enough data.  When doing a multivariate test, you want to give them equal weight to pick a champion variable quickly (unbounce.com).
Another great article to view is from Kolowich (n.d), which provides a checklist for successfully conducting an A/B Test.
Resources

Adv DBs: Data Abstractions

Data Abstraction

Text can be abstracted for information and knowledge through either hard clustering where a word has only one connection or soft clustering where a word can have multiple connections to other words (Kulkarni & Kinariwala, 2013).  Clustering, in general, is grouping things together with similar characteristics.  It is hard to do hard clustering with sentences of a paragraph or even prose because they are interconnected with the sentences above and below it.  Also, clusters within prose can overlap with each other.  Thus, it is proposed that soft clustering should be used for the analysis of sentences within the prose.  The method proposed in Kulkarni & Kinariwala is Page Rank, in order to show the importance of a sentence(s) within a document (thus helping summarize a document). The weakness of this paper lies with the fact that they propose an idea without testing it.  They didn’t develop any code or analyzed any data set to say whether their hypothesis on page rank was correct.  Thus, this was a wonderful thought experiment.  The strength if proven correct with other studies is that it maps out the limitations & strengths of hard and soft clustering in data mining within prose and between the prose of a similar nature.

Management issues in systems development

Information is seen as of great value to humanitarian efforts to accomplish their missions.  In Van de Walle & Comes (2015), they state that the United Nations had delivered methods for humanitarian Information Management revolving around checking, sharing, and use of the data.  Checking data revolves around reliability and verifiability, sharing data revolves around interoperability (data formats), accessibility, and sustainability, whereas the use of data deals with timeliness and relevance.  After interviewing humanitarians in two different disaster scenarios, Syria and Typhoon Haiyan for about 1-1.5 hours, they were able to conclude that standard processes can be followed for natural disasters like a landfalling hurricane.  Standard processes lent itself to inflexibility and not meeting all the intricate needs. In a more complicated relief effort like in Syria, confidentiality and unreliable data sources (sometimes coming in the format like an old spy movie, under the table, etc.), affected the entire process.  Finally, this small sample size of two events and humanitarian people interviewed suggest that further research is definitely needed before generalizations in developing systems of Information Management between natural disasters and geopolitical disasters can be made. The main strength of this paper is the analysis of breaking down information management of disasters with respect to standards imposed by the UN.  It also illustrates that information management is end-to-end.  My research hopes to help improve pre-disaster conditions and their research covers aid for post-disaster.  The same disaster, Hurricane landfalling, has a change in key information that is needed to carry out their respective tasks.  In other words, hurricane wind speeds are no longer needed after it passed over a city and left a wake of destruction, and the death toll is not important before the hurricane makes landfall.   But, we need wind speeds to improve forecasts and mitigate death tolls, and we need the current death toll, to make sure we can keep it from rising after the disaster has struck.

References

  • Van de Walle, B. & Comes, T. (2015) On the Nature of Information Management in Complex and Natural Disasters. Procedia Engineering, Pages 403-411.
  • Kulkarni, B. M., & Kinariwala, S. A. (2013). Review on Fuzzy Approach to Sentence Level Text Clustering. International Journal of Scientific Research and Education. Pages 3845-3850.

Adv DBs: A possible future project?

Below is a possible future research paper on a database related subject.

Title: Using MapReduce to aid in clinical test utilization patterns in the medicine

The motivation:

Efficient processing and analysis of clinical data could aid in better clinical tests on patients, and MapReduce solutions allow for an integrated solution in the medical field, which aids in saving resources when it comes to moving data in and out of storage.

The problem statement (symptom and root cause)

The rates of Sexually Transmitted Infections (STIs) are increasing at alarming rates, could the addition of Roper Saint Francis Clinical Network in the South test utilization patterns into Hadoop with MapReduce reveal patterns in the current STIs population and predict areas where an outbreak may be imminent?

The hypothesis statement (propose a solution and address the root cause)

H0: Data mining in Hadoop with MapReduce will not be able to identify any meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

H1: Data mining in Hadoop with MapReduce can identify a meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

The research questions

Could this study apply to STIs outbreaks rates be generalized into other disease outbreak rates?

Is this application of data-mining in Hadoop with MapReduce the correct way to analyze the data?

The professional significance statement (new contribution to the body of knowledge)

Identifying where an outbreak of any disease (or STIs), via clinical tests utilization patterns has yet to be done according to Mohammed et al (2014), and they have stated that Hadoop with MapReduce is a great tool for clinical work because it has been adopted in similar fields of medicine like bioinformatics.

Resources

  • Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce programming framework to clinical big data analysis: Current landscape and future trends. Biodata Mining, 7. doi:http://dx.doi.org/10.1186/1756-0381-7-22 – Doctoral Library Advanced Technologies & Aerospace CollectionPokorny, J. (2011).
  • NoSQL databases: A step to database scalability in web environment. In iiWAS ’11 Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services (pp. 278-283). – Doctoral Library ACM Digital Library

Qualitative Methods: Questions distinctions

Qualitative Research Questions distinctions

Usually, qualitative research methods start off with an open-ended central question or two with the words “what” or “how” on a single phenomenon or concept, in order to suggest an exploratory design.  The rest of the question uses exploratory verbs (report, describe, discover, seek, explore, etc) in a non-directional manner as not to suggest causation.  The research question could also be asked in a way to suggest what qualitative research methodological tool you will use to analyze the data, i.e. using the words opinions could mean interviews. Finally, the research question could include the key defining features of the participants in the study (teen, women, men, veterans, people with disabilities, etc.) (Creswell, 2014).

Central research question:

So, an example of a central question could be to do a follow on study on the results of my doctoral research.  So:

What are the opinions on the results from the use of text analytics on tropical discussions to discover weather constructs that positively and negatively affecting hurricane forecast skills perceived and used by the hurricane specialist at the National Hurricane Center?

References:

Qualitative Research: Sampling

Purposive and Theoretical Sampling:

When identifying means for recording data, one must be wary in qualitative research to how they collect data as well, it can be via unstructured or semi-structured observations and interviews, documents, and visual materials (Creswell, 2014).  Purposeful sampling is to help select the (1) actors, (2) events, (3) setting, and (4) process that will best allow the researcher to get a firm grasp at understanding and addressing their central questions and sub-questions in their study.  Also, consider how many sites and participants there should be in the study (identifying your sample size).  The sample size can vary from 1-2 in narrative research, 3-10 in grounded theory, 20-30 for ethnographic studies, and 4-5 cases in case studies (Creswell, 2014).

However, you can reach data saturation (when the research stops the data collection because there exists no more new information that would reveal any other insights or properties addressing the question of the research) before any of these aforementioned numbers (Creswell, 2014). Theoretical sampling is theoretically bound around a concept, but this type of sampling touches more on this concept of data saturation.  Thus, when the researcher is trying to understand the data in order to help them define or understand their theory to the point of data saturation, rather than reaching a defined number.

Example:

An example of this could come from studying the effects of business decisions affecting the family through analyzing relocation decisions on non-military families. (PROCESS)  Purposefully I would like to sample in this example are three groups of families, ACTORS: those with no children, those with children that are no older than 12 years of age, and those with one or more children over the age of 13.  I want to see if there is a difference between the reactions based on having kids and having kids that are older versus younger, over the past decade (EVENT) at Boeing (SETTING).  I could aim for 20-30 families per group to a total of 60-90 sample size, or I could aim for data saturation between each of these groups (Theoretically sampling).  If I want to stick with 60-90 as a total sample size, I could aim for an open answer survey or conduct interviews (which is more costly on my end).  If I wish to aim for data saturation, it can be more easily done with interviews.

References:

Adv DBs: Unsupervised and Supervised Learning

Unsupervised and Supervised Learning:

Supervised learning is a type of machine learning that takes a given set of data points, we need to choose a function that gives users a classification or a value.  So, eventually, you will get data points that no longer defines a classification or a value, thus the machine now has to solve for that function. There are two main types of supervised learning: Classification (has a finite set, i.e. based on person’s chromosomes in a database, their biological gender is either male or female) and Regression (represents real numbers in the real space or n-dimensional real space).  In regression, you can have a 2-dimensional real space, with training data that gives you a regression formula with a Pearson’s correlation number r, given a new data point, can the machine use the regression formula with correlation r to predict where that data point will fall on in the 2-dimensional real space (Mathematicalmonk’s channel, 2011a).

Unsupervised learning aims to uncover homogenous subpopulations in databases (Connolly & Begg, 2015). In Unsupervised learning you are given data points (values, documents, strings, etc.) in n-dimensional real space, the machine will look for patterns through either clustering, density estimation, dimensional reduction, etc.  For clustering, one could take the data points and placing them in bins with common properties, sometimes unknown to the end-user due to the vast size of the data within the database.  With density estimation, the machine is fed a set of probability density functions to fit the data and it begins to estimates the density of that data set.  Finally, for dimensional reduction, the machine will find some lower dimensional space in which the data can be represented (Mathematicalmonk’s channel, 2011b).  With the dimensional reduction, it can destroy the structure that can be seen in the higher-order dimensions.

Applications suited to each method

  • Supervised: defining data transformations (Kelvin to Celsius, meters per second to miles per hour, classifying a biological male or female given the number of chromosomes, etc.), predicting weather (given the initial & boundary conditions, plug them into formulas that predict what will happen in the next time step).
  • Unsupervised: forecasting stock markets (through patterns identified in text mining news articles, or sentiment analysis), reducing demographical database data to common features that can easily describe why a certain population will fit a result over another (dimensional reduction), cloud classification dynamical weather models (weather models that use stochastic approximations, Monte Carlo simulations, or probability densities to generate cloud properties per grid point), finally real-time automated conversation translators (either spoken or closed captions).

Most important issues related to each method

Unsupervised machine learning is at the bedrock of big data analysis.  We could use training data (a set of predefined data that is representative of the real data in all its n-dimensions) to fine-tune the most unsupervised machine learning efforts to reduce error rates (Barak & Modarres, 2015). What I like most about unsupervised machine learning is its clustering and dimensional reduction capabilities, because it can quickly show me what is important about my big data set, without huge amounts of coding and testing on my end.

References: