Central research questions

Qualitative central research questions 

One or two questions that ask for an exploration of a phenomenon or concept to study. Creswell (2013), states to arrive at these one to two questions you must ask “What is the broadest question that I can ask in the study?” Here, we aim to explore the general and even complex factors of our research issue, hoping to draw meaning from various perspectives within our sample. For each central question, five to seven sub-questions could be asked to help focus the study. We want them narrow enough to focus the study in a direction, but not too narrow that we don’t leave any room for open questioning. It is from these sub-questions where we derive more specific questions for our interviews with the participant of our study. To develop strong central questions, Creswell (2013) suggests these tips: 

  1. Begin the research questions with the words “what’ or “how” to convey an open and emerging design. 
  2. Focus on a single phenomenon or concept. 
  3. Use exploratory verbs that convey the language of emerging design. 
  4. Use these more exploratory verbs as non-directional rather than directional words that suggest quantitative research, such as “effect”, “influence”, “impact”, “determine”, “cause” and “relate”. 
  5. Use open-ended questions without reference to literature or theory unless otherwise indicated by a qualitative strategy of inquiry. 
  6. Specify the participants and the research site for the study if the information has not yet been given. 

Quantitative research questions 

It is not like quantitative studies, which aim for a specific goal, a narrow question, focusing on a few variables, thinking about their hypothesis, which is then used to predict the relationship strength of variables via statistical means. 


Creswell, J. W. (2013). Research Design: Qualitative, Quantitative, and Mixed Methods Approaches, 4th Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781483321479/ 

Pros and Cons of Hadoop MapReduce

The are some of the advantages and disadvantages of using MapReduce are (Lusblinksy et al., 2014; Sakr, 2014):


  • Hadoop is ideal because it is a highly scalable platform that is cost-effective for many businesses.
  • It supports huge computations, particularly in parallel execution.
  • It isolates low-level applications such as fault-tolerance, scheduling, and data distribution.
  • It supports parallelism for program execution.
  • It allows easier fault tolerance.
  • Has a highly scalable redundant array of independent nodes
  • It has a cheap unreliable computer or commodity hardware.
  • Aggregation techniques under the mapper function can exploit multiple different techniques
  • No read or write of intermediate data, thus preserving the input data
  • No need to serialize or de-serialize code in either memory or processing
  • It is scalable based on the size of data and resources needed for processing the data
  • Isolation of the sequential program from data distribution, scheduling, and fault tolerance


  • The product is not ideal for real-time process data. During the map phase, the process creates too many keys, which consume sorting time. 
  • Most of the MapReduce outputs are merged.
  • MapReduce cannot use natural indices.
  • It is a must to buffer all the records for a particular join from the input relations in repartition join.
  • Users of the MapReduce framework use textual formats that are inefficient.
  • There is a huge waste of CPU resources, network bandwidth, and I/O since data must be reprocessed and loaded at every iteration.
  • The common framework of MapReduce doesn’t support applications designed for iterative data analysis.
  • When a fixed point is reached, detection may be the termination condition that calls for more MapReduce job that incurs overhead.
  • The framework of MapReduce doesn’t allow building one task from multiple data sets.
  • Too many mapper functions can create an infrastructure overhead, which increases resources and thus cost 
  • Too few mapper functions can create huge workloads for certain types of computational nodes
  • Too many reducers can provide too many outputs, and too few reducers can provide too few outputs
  • It’s a different programming paradigm that most programmers are not familiar with
  • The use of available parallelism will be underutilized for smaller data sets


  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.

Mobile & Distributed Database Management Systems

A transaction is a set of operations/transformations to be carried out on a database or relational dataset from one state to another. Once completed and validated to be a successful transaction, the ending result is saved into the database (Panda et al, 2011). Both ACID and CAP (discussed in further detail) are known as Integrity Properties for these transactions (Mapanga & Kadebu, 2013).

Mobile Databases

Mobile devices have become prevalent and vital for many transactions when the end-user is unable to access a wired connection. Since the end-user is unable to find a wired connection to conduct their transaction their device will retrieve and save information on the transaction either on a wireless connection or disconnected mode (Panda et al, 2011). A problem with a mobile user accessing and creating transactions with databases, is the bandwidth speeds in a wireless network are not constant, which if there is enough bandwidth connection to the end user’s data is rapid, and vice versa. There are a few transaction models that can efficiently be used for mobile database transactions: Report and Co-transactional model; Kangaroo transaction model; Two-Tiered transaction model; Multi-database transaction model; Pro-motion transaction model; and Toggle Transaction model. This is by no means an exhaustive list of transaction models to be used for mobile databases. 

According to Panda et al (2011), in a Report and Co-transactional Model, transactions are completed from the bottom-up in a nested format, such that a transaction is split up between its children and parent transaction. The child transaction once completed then feeds that information up to the chain until it reaches the parent. However, not until the parent transaction is completed is everything committed. Thus, a transaction can occur on the mobile device but not be fully implemented until it reaches the parent database. In the Kangaroo transaction model, a mobile transaction manager collects and accepts transactions from the end-user, and forwards (hops) the transaction request to the database server. Transaction made in this model is done by proxy in the mobile device, and when the mobile devices move from one location to the next, a new transaction manager is assigned to produce a new proxy transaction. The two-Tiered transaction model is inspired by the data replication schemes, where there is a master copy of the data but for multiple replicas. The replicas are considered to be on the mobile device but can make changes to the master copy if the connection to the wireless network is strong enough. If the connection is not strong enough, then the changes will be made to the replicas and thus, it will show as committed on these replicas, and it will still be made visible to other transactions. 

The multi-database transaction model uses asynchronous schemes, to allow a mobile user to unplug from it and still coordinate the transaction. To use this scheme, five queues are set up: input, allocate, active, suspend, and output. Nothing gets committed until all five queues have been completed. Pro-motion transactions come from nested transaction models, where some transactions are completed through fixed hosts and others are done in mobile hosts. When a mobile user is not connected to the fixed host, it will spark a command such that the transaction now needs to be completed in the mobile host. Though carrying out this sparked command is resource-intensive. Finally, the Toggle transaction model relies on software on a pre-determined network and can operate on several database systems, and changes made to the master database (global) can be presented different mobile systems and thus concurrency is fixed for all transactions for all databases (Panda et al, 2011).  

At a cursory glance, these models seem similar but they vary strongly on how they implement the ACID properties in their transaction (see table 1) in the next section.

ACID Properties and their flaws

Jim Gray in 1970 introduced the idea of ACID transactions, which provide four guarantees: Atomicity (all or nothing transactions), Consistency (correct data transactions), Isolation (each transaction is independent of others), and Durability (transactions that survive failures) (Mapanga & Kedebu, 2013, Khachana et al, 2011; Connolly & Begg, 2015). ACID is used to assure reliability in a database system, due to a transaction, which changes the state of the data in the database. This approach is perfect for small relational centralized/distributed databases, but with the demand to make mobile transactions, big data, and NoSQL, the ACID may be a bit constricting. The web has independent services connected relationally, but hard to maintain (Khachana et al, 2011). An example of this is booking a flight for a CTU Doctoral Symposium. One purchases a flight, but then also may need another service that is related to the flight, like ground transportation to and from the hotel, the flight database is completely different and separate from the ground transportation system, yet sites like Kayak.com provide the service of connecting these databases and providing a friendly user interface for their customers. Kayak.com has its own mobile app as well. So taking this example further we can see how ACID, perfect for centralized databases, may not be the best for web-based services. Another case to consider is, mobile database transactions, due to their connectivity issues and recovery plans, the models aforementioned cover some of the ACID properties (Panda et al, 2011). This is the flaw for mobile databases, through the lens of ACID.

Table 1

Mobile Distributed Database Management Systems Transaction Models vs ACID.

Report & Co-transaction modelYesYesYesYes
Kangaroo transaction modelMaybeNoNoNo
Two-tiered transaction modelNoNoNoNo
Multi-database Transaction modelNoNoNoNo
Pro-motion ModelYesYesYesYes
Toggle transaction modelYesYesYesYes

Note: A subset of the information found in Panda et al (2011) dealing with mobile database system transaction models and how they use or do not use the ACID properties.

CAP Properties and their trade-offs

CAP stands for Consistency (just like in ACID, correct all data transactions and all users see the same data), Availability (users always have access to the data), and Partition Tolerance (splitting the database over many servers do not have a single point of failure to exist), which was developed in 2000 by Eric Brewer (Mapanga & Kadebu, 2013; Abadi, 2012; Connolly & Begg, 2015). These three properties are needed for distributed database management systems and are seen as a less strict alternative to the ACID properties by Jim Gary. Unfortunately, you can only create a distributed database system using two of the three systems so a CA, CP, or AP systems. 

CP systems have a reputation of not being made available all the time, which is contrary to the fact. 

Availability in a CP system is given up (or out-prioritized) when Partition Tolerance is needed. Availability in a CA system can be lost if there is a partition in the data that needs to occur (Mapanga & Kadebu, 2013). Though you can only create a system that is the best in two, that doesn’t mean you cannot add the third property in there, the restriction only talks applies to priority. In a CA system, ACID can be guaranteed alongside Availability (Abadi, 2012)Partitions can vary per distributed database management systems due to WAN, hardware, a network configured parameters, level of redundancies, etc. (Abadi, 2012). Partitions are rare compared to other failure events, but they must be considered. But, the question remains for all database administrators: 

Which of the three CAP properties should be prioritized above all others? Particularly if there is a distributed database management system with partitions considerations. Abadi (2012) answers this question, for mission-critical data/applications, availability during partitions should not be sacrificed, thus consistency must fall for a while.

Amazon’s Dynamo & Riak, Facebook’s Cassandra, Yahoo’s PNUTS, and LinkedIn’s Voldemort are all examples of distributed database systems, which can be accessed on a mobile device (Abadi, 2012). 

However, according to Abadi (2012), latency (similar to Accessibility) is critical to all these systems, so much so that a 100ms delay can significantly reduce an end user’s future retention and future repeat transactions. Thus, not only for mission-critical systems but for e-commerce, is availability during partitions key.

Unfortunately, this tradeoff between Consistency and Availability arises due to data replication and depends on how it’s done. 

According to Abadi (2012), there are three ways to do data replications: data updates sent to all the replicas at the same time (high consistency enforced); data updates sent to an agreed-upon location first through synchronous and asynchronous schemes (high availability enforced dependent on the scheme); and data updates sent to an arbitrary location first through synchronous and asynchronous schemes (high availability enforced dependent on the scheme). According to Abadi (2012), PNUTS sends data updates sent to an agreed-upon location first through asynchronous schemes, which improves Availability at the cost of Consistency. Whereas, Dynamo, Cassandra, and Riak send data updates sent to an agreed-upon location first through a combination of synchronous and asynchronous schemes. 

These three systems, propagate data synchronously, so a small subset of servers and the rest are done asynchronously, which can cause inconsistencies. All of this is done to reduce delay to the end-user. 

Going back to the Kayak.com example from the previous section, consistency in the web environment should be relaxed (Khachana et al, 2011). Further expanding on Kayak.com, if 7 users wanted to access the services at the same time they can ask which of these properties should be relaxed or not. One can order a flight, hotel, and car, and enforce that none is booked until all services are committed. Another person may be content with whichever car for ground transportation as long as they get the flight times and price they want. This can cause inconsistencies, information being lost, or misleading information needed for proper decision analysis, but systems must be adaptable (Khachana et al, 2011). They must take into account the wireless signal, their mode of transferring their data, committing their data, and load-balance of the incoming request (who has priority to get a contested plane seat when there is only one left at that price). At the end of the day, when it comes to CAP, Availability is king. It will drive business away or attract it, thus C or P must give, to cater to the customer. If I were designing this system, I would run an AP system, but conduct the partitioning when the load/demand on the database system will be small (off-peak hours), so to give the illusion of a CA system (because Consistency degradation will only be seen by fewer people). Off-peak hours don’t exist for global companies or mobile web services, or websites, but there are times throughout the year where transaction to the database system is smaller than normal days. So, making around those days is key. For a mobile transaction system, I would select a pro-motion transaction system that helps comply with ACID properties. Make the updates locally on the mobile device when services are not up, and set up a queue of other transactions in order, waiting to be committed once wireless service has been restored or a stronger signal is sought. 


  • Abadi, D. J. (2012). Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. IEEE Computer Society, (2), 37-42.
  • Connolly, Thomas & Begg, Carolyn (2015). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. Pearson Education, Inc., publishing as Addison-Wesley, Upper Saddle River, New Jersey.
  • Khachana, R. T., James, A., & Iqbal, R. (2011). Relaxation of ACID properties in AuTrA, The adaptive user-defined transaction relaxing approach. Future Generation Computer Systems, 27(1), 58-66.
  • Mapanga, I., & Kadebu, P. (2013). Database Management Systems: A NoSQL Analysis. International Journal of Modern Communication Technologies & Research (IJMCTR), 1, 12-18.
  • Panda, P. K., Swain, S., & Pattnaik, P. K. (2011). Review of some transaction models used in mobile databases. International Journal of Instrumentation, Control & Automation (IJICA), 1(1), 99-104.

Article Review: Knowledge Discovery through Text Analytics

The article “IT innovation adoption by enterprises: Knowledge Discovery through text analytics” was select as an article in the field of study. The path to get to this article is articulated below:

Science Direct (Elsevier) > Search Term: “Text Analytics”

This article was chosen due to my interest in text analytics of huge data sets, to help derive knowledge from this unstructured data set. This was a primary topic/interest for the dissertation, along with identifying another unconventional way to do literature reviews.


This study investigated two premises: (1) Is it possible to use text data mining techniques to conduct a more thorough and efficient literature review on any subject matter? (2) What are the drivers of IT Innovations? After having identified 472 quality articles spanning multiple fields in business administration and 30 years of knowledge.  The authors used a tool called Northernlight (http://georgiatech.northernlight.com/), where they were able to answer both premises. 

The authors, state that current methods of most literature reviews are time-consuming, usually focus in the last five years and involve tons of attention. Most articles are scanned by title and abstracts before the researcher considers them to be read in their entirety. This method, as argued, is not useful. Thus, effective techniques consist of the use of “meaning extract” of a large set of documents (usually considered unstructured data sets) across various domains should help the researcher to obtain and discover knowledge efficiently. Therefore, the first premise deals with the utilization of text data mining techniques. These techniques shouldn’t just merely revolve around a core system of counting the identified keyphrases (or “themes”), but on automating “meaning extraction.” “Meaning extraction” measures the strength between keywords or phrases that are related to others. The end-user/researcher can apply rules to help enhance meaning extraction between sets of keywords. The authors conclude that these techniques are an excellent way to do a first-pass analysis. The first pass analysis can help generate more questions, which can lead to more future insights.

They prove the first premise by applying the Northernlight system towards IT innovation. The authors then used 472 data sets, in which IT innovation is mentioned in multiple disciplines across the field of business administration. By setting rules to identify keyword proximity to other keywords (or their equivalents) they were able to garner some insights into IT Innovation. Proximity could be measured as far as a sentence (~40 words) or a paragraph (~150 words). Thus, they determined that cost and complexity are the two most frequent IT innovation determinants (as well as complexity, compatibility, and relative advantage) based on an IT department’s perspective. However, on the enterprise level, perceived benefits, perceived usefulness, and ease of use, were determinants of IT innovation. Finally, organization size and top management support positively correlated with IT innovation with cost being negative towards IT innovation.


The obstacle that came in here was that some articles with really creative titles and were recently published came at a price. So the article that was chosen was still a good read, but one does wonder how good those papers are that have been priced/paywall. Is having a paywall from publicly funded sources be hidden behind a paywall. We paid for it through our taxes, why should we have to pay for it once the results are out.


Data Allocation Strategies

Data allocations are how one logical group of data gets spread across a destination data set, i.e. a group of applications which uses multiple servers (Apptio, 2015). According to ETL-Tools (n.d.), they state that a depending on the data allocation one can get different granularity levels. This can be a judgment call. and understanding your allocation strategy is vital for developing and understanding your data models (Apptio, 2015; ETL-Tools, n.d.).

The robustness and accuracy of the model depend on the allocation strategy between data sets, especially because the wrong allocation can create data fallout (Apptio, 2015). Data fallout is where data isn’t assigned between data sets. For instance, like how most SQL join (join left, join right, etc.) statements fail to combine every line of data between two data sets.

ETL-Tools (n.d.), stated that there are dynamic and fixed level granularity, however Apptio (2015), stated there can be many different levels of granularity. The following are some of the different data allocation strategies (Apptio, 2015; Dhamdhere, 2014; ETL-Tools, n.d.):

  1. Even spread allocation: data allocation where all data points are assigned the same allocation no matter what (i.e. every budget in the household gets the total sum of dollars divided by the number of budgets, regardless that the mortgage costs more than the utilities). It is the easiest to implement but its too overly simplified.
  2. Fixed allocation: data allocation based on data that doesn’t change, which stays constant (i.e. credit card limits). Easy to implement but the logic can be risky for data sets that can change over time.
  3. Assumption-based allocation (or manually assigned percentage or weights): data allocation based on arbitrary means or an educated approximation (i.e. budgets, but not a breakdown). Uses subject matter experts, but it is as good as the level of expertise making the estimates.
  4. Relationship-based allocation: data allocation based on the association between items (i.e. hurricane max wind-speeds and hurricane minimum central pressure). This can be easily understood, however, there may be some nuance that can be lost. In the given example there can be a lag between hurricane max wind-speeds and hurricane minimum central pressure, meaning a high correlation but still has errors.
  5. Dynamic allocation: data allocations based on data that can change off of a calculated field (i.e. tornado wind-speed to e-Fujita scale). Easily understood, unfortunately, it is still an approximation at a higher level of fidelity than lower levels of allocations.
  6. Attribute-based allocation: data allocations weighted by a static attribute of an item (i.e. corporate cell phone costs and data usage by service provider like AT&T, Verizon, T-mobile; Direct spend weighting of shared expenses). Reflects real-life data usage, but lacks granularity when you want to drill down to find the root cause.
  7. Consumption-based allocation: data allocation by measured consumption (i.e. checkbook line item, general ledgers, activity-based costing). Huge data sets needed, greater fidelity, but must be updated frequently.
  8. Multi-dimensional allocation: data allocation based on multiple factors. It could be the most accurate level of allocation for complex systems, it can be hard to understand from an intuitive level therefore not as transparent as a consumption-based allocation.

The higher the number, the more mature/higher the level of granularity of the data. Sometimes it is best to start at a level 1 maturity and work our way up to a level 8. Dhamdhere (2014), suggests that for best practice consumption-based allocation (i.e. activity-based costing) is a best practice when it comes to allocation strategies given its focus on accuracy. However, some levels of maturity may not be acceptable in certain cases (ETL-tools, n.d.). Please take into consideration what is the best allocation strategy for yourself, for the task before you, and the expectations of the stakeholders.


Foul on the X-axis and more

There are multiple ways to use data to justify any story or agenda one has. My p-hacking post shows how statistics have been used to get statistically significant results. Therefore you can get your work to publish, and with journal articles and editors not glorifying replication studies, it can be hard to fund them. However, there are also ways to manipulate graphs to meet any narrative you want. Take the figure below, which was published by the Georgia Department of Public Health Website on May 10, 2020. Notice something funny going on in the x-axis, it looks like a Dr. Who’s voyage across time trying to solve the Corona Virus crisis. The dates on the x-axis are not in chronological order (Bump, 2020; Fowler, 2020, Mariano & Trubey, 2020; McFall-Johnsen, 2020, Wallace, 2020). The dates are in the order they need to be, to make it appear that the number of coronavirus cases in Georgia’s top 5 impacted counties is decreasing over time.

Figure 1: May 10 top five impacted counties bar chart from the Georgia Department of Public Health website.

The figure above, if the dates were lined up appropriately would tell a different story. Once this chart was made public, it garnered tons of media coverage and was later fixed. But, this happens all the time when people have an agenda. They mess with the axis, to give them the result they want. It is really rare though to see a real-life example of it on the x-axis.

But wait, there’s more! Notice the grouping order of the top five impacted counties. Pick a color, it looks like the Covid-19 counts per county are playing musical chairs. What was done here was, they ordered each day as top five counties in descending count order, which makes it even harder to understand and interpret, again sewing a narrative that may not be accurate (Bump, 2020; Fowler, 2020, Mariano & Trubey, 2020; McFall-Johnsen, 2020, Wallace, 2020).

Now according to Fowler (2020), there are issues in how the number of Covid-19 cases gets counted here, which adds to misinformation and sews further distrust. It is just another way to build a narrative you wish you had, but carving out an explicit definition of what is in and what is out, you can cause an artificial skew in your data, again to favor a narrative or produce false results that could be accidentally generalized. Here Fowler explains:

“When a new positive case is reported, Georgia assigns that date retroactively to the first sign of symptoms a patient had – or when the test was performed, or when the results were completed. “

Understanding that the virus had many asymptomatic carriers that never got reported is also part of the issue. Understanding that you could be asymptomatic for days and still have Covid-19 in your system, means that the definition above is completely inaccurate. Also, Fowler explains that if there was a Covid-19 test, there is such a backlog of tests, that it could take days to report a positive case, so reporting the last 14 days, these numbers along with the definition will see those numbers shift wildly throughout each iteration of the graph. So, when the figure one was fixed, the last 14 days will inherently show a decrease in cases, due to backlog, definition, and understanding of the virus, see figure 2.

Figure 2: May 19 top five impacted counties bar chart from the Georgia Department of Public Health website.

They did fix the ordering of the counties and the x-axis. But after it was reported by Fox News, Washington Post, and Business Insiders, to report a few. However, the definition of what counts as a Covid-19 case distorts the numbers and still tells the wrong story. It is easy to see this effect when you compare May 4-9 data between Figure 1 and Figure 2. Figure 2 has a higher incidence of Covid-19 recorded, over that same period. That is why definitions and criteria matter just as much as how graphs can be manipulated.

Mariano & Trubey (2020) does have a point, some errors are expected during a time of chaos, but, common chairmanship behavior should be observed. However, be careful of how data is collected, how it is represented on graphs and look at not only the commonly manipulated Y-axis but also the X-axis. That is why the methodology sections in peer-reviewed work are extremely important.


Parallel Programming: Compelling Topics

(0) A thread is a unit (or sequence of code) that can be executed by a scheduler, essentially a task (Sanden, 2011). A single thread (task) will have one program counter and a sequence of code. Multi-threading occurs when one program counter shares a common code. Thus, the counter in multi-threading has many sequences of code that can be assigned to different processors to run in parallel (simultaneously) to speed up a task. Another way for multi-threading is to have the counter execute the same code on different processors with different inputs. If data is shared between the threads, there is a need for a “safe” object through synchronization, where one thread can access the data stored in a “safe” object at one time. It is through these “safe” objects that a thread can communicate with another thread.

(1) Sanden (2011) shows to use synchronized objects (concurrency in Java), which is a “safe” object, that are protected by locks in critical synchronized methods.  Through Java we can create threads by: (1) extend class Thread or (2) implement the interface Runnable.  The latter defines the code of a thread under a method: void run ( ), and the thread completes its execution when it reaches the end of the method (which is essentially like a subroutine in FORTRAN).  Using the former you need the contractors public Thread ( ) and public Thread (Runnable runObject) along with methods like public start ( ).

(2) Shared objects force mutual exclusion on threads that try to call it are “safe objects”.  The mutual exclusion on threads/operations can be relaxed when threads don’t change any data, this may be a read of the data in the “safe object” (Sanden, 2011).

(3) Deadlock occurs while you are getting an additional resource while holding another or more resource, especially when it creates a circularity. To prevent deadlocks, resources need to be controlled.  One should do a wait chain diagram to make sure your design can help prevent a deadlock.  Especially when there is a mix of transactions occurring.  A good example of a deadlock is a stalemate in Chess or as Stacy said, a circular firing squad.

(4) In a distributed system nodes can talk (cooperate) to each other and coordinate their systems.  However, the different nodes can execute concurrently, there is no global clock in which all nodes function on, and some of these nodes can fail independently.  Since nodes talk to each other, we must study them as they interact with each other.  Thus, a need to use logical clocks (because we don’t have global clocks) which show that distances in time are lost. In logical clocks: all nodes agree on an order of events, partially (where something can happen before another event).  They only describe the order of events, not with respect to time.  If nodes are completely disjoint in a logical clock, then a node can fail independently. (This was my favorite subject because I can now visualize more about what I was reading and the complex nature of nodes).

(5) An event thread is a totally ordered sequence of event occurrences, and where a control thread processes each occurrence in turn.  In the event thread, we can have 2 occurrences act in either:

  • x — > y
  • y — >
  • x || y

Events in this thread must be essential to the situation they are being used for and independent of any software design.  Essential threads can be shared like by time, domain, or by software, while others are not shared, as they occur inside the software.


Parallel Programming: State Diagram Example


  • Enter into state S0 into the superstate S1 through event 1 and yields action a1.
  • When entering into superstate S1, we must go through state S12, with action a7 to enter and action a3 to exit.
    • If action a3 yields an event e9, which yielded action a9, we enter into state S13, causing action a6 and action a12 to exit.
      • If action a12 yields an event e5, we will get action a5 and we hit the superstate S1 and begin again to state S2.
      • If action a12 yields an event e9, we will use action a1 an enter state S112 (under the S11 superstate) with an entry of an action a11.
        • Event e2 acts on S112, to get action 2 which enters the superstate S11.
          • Entering into the superstate through state S112 we get an exit criterion of action a14 and we end.
          • If exiting state S112 we do event e1 and action a1 we are sent back to state S12 to start again.
          • If we exit state S112 we do event e3 and action a3 which is used to enter into state S1 follow 1.a.
    • If action a3 in state S12 yields event e4 and action a4, we enter the superstate S11. Entering super state S11 this way we enter into state S111 with an entry action of a8.
      • We then carry out event e9 and action a1 to get to state S112. If this happens follow 1.a.i.2.

Parallel Processing: Ada Tasking and State Modeling

Sample Code

1  protected Queue is
2          procedure Enqueue (elt : in Q_Element);     
3                                            -- Insert elt @ the tail end of the queue
4          function First_in_line return Q_Element;
5                                            -- Return the element at the head of the
6                                            -- queue, or no_elt.
7          procedure Dequeue;                  
8                                            -- If the queue is not empty, remove the
9                                            -- element at its head
10 end Queue;
12 task type Worker;
14 task body Worker is
15   elt : Q_Element;                        -- Element retrieved from queue
16   begin
17      while true loop
18           elt := Queue.First_in_line;     -- Get element at head of queue
19           if elt = no_elt then            -- Let the task loop until there
20              delay 0.1;                   -- is something in the queue
21           else
22               Process (elt);              -- Process elt. This takes some time
23               Queue.Dequeue;              -- Remove element from queue
24           end if;
25     end loop;
26 end Worker;

Comparing and Contrasting Ada’s Protected Function to Java’s (non)synchronized

Java’s safe objects are synchronized objects, which are usually contained in methods that are “synchronized” or “non-synchronized”.  For the non-synchronized methods, the safe objects within these methods are mostly considered to be read-only. Whereas in synchronized methods, safe objects can be written and read, but usually have wait loops at the beginning with a certain wait condition (i.e. while (condition) {wait( );}).  This wait loop forces the thread to wait until when the condition becomes true, in order to prevent multiple threads from editing the same safe object at the same time.   Usually, wait loops located elsewhere in the Java synchronized methods are (uncommon).

Safe objects in Ada are protected objects are in the “protected procedure” or a “protected function”.  Unlike the non-synchronized method in Java (where it should be read-only), the protected function in Ada is read-only.  Java’s synchronized version has a wait function that stalls a thread, in the Ada’s protected entry (i.e. entry x ( ) when condition is …) is only entered when the condition is true, thus you can have multiple entries where data could manipulate in multiple ways similar to an if-else function.  For example, entry x ( ) when condition is true and another one right after could be entry x ( ) when condition is false.  Though, this can be expanded to n different conditions, where.  With these entries, the barrier is always tested first compared to wait.  However, we could requeue (not a subprogram call-thus the tread doesn’t return to the point after the requeue call) on another entry, but it’s uncommon to have them located elsewhere in the program.

Describe what worker does

Workers must get the element (elt) from the first line item in the queue and then loop through the task until there is an element in the queue for which the worker can process the element.  The element is stored in the elt array.  If there is no element delay the process by 0.1 units of time and keep looping.  Once the worker has obtained an element, they can begin processing the element, then we can remove the element from the current queue.

Adapt Queue and work for multiple instances of worker can process data elements

In order for one worker to process each element, we must first do is change “task body worker is” to “protected body worker is” on line 14.  Change “elt: Q_Element;” to procedure get (elt: Q_Element) is in order to get the element from the queue on line 15.

Once there is an element in the first inline of the queue, the worker must first dequeue it in order to process it, this should protect the data and allow for another worker to work on the next first inline element.  Thus, I would be proposing to switch lines 22 to 23 and 23 to 22.  If this isn’t preferred, we can create a new array called work_in_progress where we create a get, put, and remove procedure for this array, which should go before line 22 and then follow my proposed sequence.  This will allow the worker to say I got this element, I will work on it, and if all is successful we don’t need to re-add the element back into the queue and delete it from the work_in_progress array, but I don’t want to hold up other workers from working on other elements.  However, if the worker says I failed to process this array, please return it back into the queue and add it into the elt array again for another worker to process it. To avoid an endless loop, if an element cannot be processed by three different workers we can create another array to store non-compliant elements in and call Dequeue on the main elt array.   However, we can simply and only switch lines 22 and 23 with each other if and only if this change shows that processing the element could never fail.

In Queue line 2 must have entries, for instance “entry Enqueue (elt : in Q_Element) when count >= 0 is … end Enqueue”, to allow for waiting until there are actually element to be added from the array elt.  Doing entries in Queue, would be eliminating the need for the while true loop to search if there is an elt in the first of the line in lines 19, 20, & 21.   Thus, we are making our conditions check first rather than later on in the code. Similarly, we can do a change to line 7 to “entry Dequeue (elt : in Q_Element) when count > 0 is … end Dequeue”, to allow for waiting until there is actually element for deletion from the array elt.  Though this is more for an efficiency issue and allows for the worker to say I got this element and it’s ok to delete from the queue. With all these changes we must make sure that on line 18 we must make sure we are pulling an element from an elt array.

The loop where instances of workers wait is crude

Line 18 and 23 can be combined with Queue.Pop (elt) above the if-statement in on line 18, to avoid the crude loop where threads of workers wait for something in the queue.  The pop allows for no “busy waiting”.  But, we must create a procedure in Queue called procedure pop which returns a query on the first line item on the array elt and removes it.


The context with an image

Sometimes as a data scientist or regular scientist, we produce beautiful charts that are chock-full of meaning and data, however to those in the outside world, it can be misconstrued.  To avoid the scenario of misreading your graphs on a dashboard, paper, or even a blog post, sometimes context is needed.  The amount of context needed will depend on the complexity of understanding and severity of misinterpretation. The higher the complexity the more contextual text is needed to help the reader digest the information you are presenting. The higher the severity of misinterpretation, i.e. life-threatening if misread or loss of millions of dollars, should also include more contextual text.

Contextual text can help a reader understand your tables, graphs, or dashboards but not every instance requires the same level of detail throughout.  The following are just meer examples of what light, medium, and heavy context could include:

Light Context (bullet points)

  • Source system or source details
  • Details on allocations impacting objects in a model
  • Details on data joins or algorithms used
  • Data nuances (excludes region x)

Medium Context (Calling out use cases)

  • A succinct explanation of what the user should be able to get out of the report/reporting area, graph, table, or dashboard

Heavy Context (Paragraph Explanations)

  • The best example is the results section of a scientific peer-reviewed journal, which not only has a figure description, but they go into detail about areas to pay attention to, outliers, etc.

Below is an example from the National Hurricane Center’s (NHC) 5-day forecast cone for Tropical Storm Sebastian.  Notice the

“Note: The cone contains the probable path of the storm center but does not show the size of the storm.  Hazardous conditions can occur outside of the cone.” (NHC, 2019).

This line alone falls under light context until you add the key below, which is a succinct explanation of how to read the graphic, making the whole graphic fall under medium context.


A secondary image produced originally by the NHC for Hurricane Beryl in 2016 shows an example of a heavy context below the NHC image, when text is added by an app. In this application, where this image is pulled from, the following block of text states the following (Appadvice.com, n.d.):

“This graphic shows an approximate representation of coastal areas under a hurricane warning (red), hurricane watch (pink), tropical storm warning (blue) and tropical storm watch (yellow). The orange circle indicates the current position of the center of the tropical cyclone. The black line, when selected, and dots show the National Hurricane Center (NHC) forecast track of the center at the times indicated. The dot indicating the forecast center location will be black if the cyclone forecast to be tropical and will be white with a black outline if the cycle is …”