Data Allocation Strategies

Data allocations are how one logical group of data gets spread across a destination data set, i.e. a group of applications which uses multiple servers (Apptio, 2015). According to ETL-Tools (n.d.), they state that a depending on the data allocation one can get different granularity levels. This can be a judgment call. and understanding your allocation strategy is vital for developing and understanding your data models (Apptio, 2015; ETL-Tools, n.d.).

The robustness and accuracy of the model depend on the allocation strategy between data sets, especially because the wrong allocation can create data fallout (Apptio, 2015). Data fallout is where data isn’t assigned between data sets. For instance, like how most SQL join (join left, join right, etc.) statements fail to combine every line of data between two data sets.

ETL-Tools (n.d.), stated that there are dynamic and fixed level granularity, however Apptio (2015), stated there can be many different levels of granularity. The following are some of the different data allocation strategies (Apptio, 2015; Dhamdhere, 2014; ETL-Tools, n.d.):

  1. Even spread allocation: data allocation where all data points are assigned the same allocation no matter what (i.e. every budget in the household gets the total sum of dollars divided by the number of budgets, regardless that the mortgage costs more than the utilities). It is the easiest to implement but its too overly simplified.
  2. Fixed allocation: data allocation based on data that doesn’t change, which stays constant (i.e. credit card limits). Easy to implement but the logic can be risky for data sets that can change over time.
  3. Assumption-based allocation (or manually assigned percentage or weights): data allocation based on arbitrary means or an educated approximation (i.e. budgets, but not a breakdown). Uses subject matter experts, but it is as good as the level of expertise making the estimates.
  4. Relationship-based allocation: data allocation based on the association between items (i.e. hurricane max wind-speeds and hurricane minimum central pressure). This can be easily understood, however, there may be some nuance that can be lost. In the given example there can be a lag between hurricane max wind-speeds and hurricane minimum central pressure, meaning a high correlation but still has errors.
  5. Dynamic allocation: data allocations based on data that can change off of a calculated field (i.e. tornado wind-speed to e-Fujita scale). Easily understood, unfortunately, it is still an approximation at a higher level of fidelity than lower levels of allocations.
  6. Attribute-based allocation: data allocations weighted by a static attribute of an item (i.e. corporate cell phone costs and data usage by service provider like AT&T, Verizon, T-mobile; Direct spend weighting of shared expenses). Reflects real-life data usage, but lacks granularity when you want to drill down to find the root cause.
  7. Consumption-based allocation: data allocation by measured consumption (i.e. checkbook line item, general ledgers, activity-based costing). Huge data sets needed, greater fidelity, but must be updated frequently.
  8. Multi-dimensional allocation: data allocation based on multiple factors. It could be the most accurate level of allocation for complex systems, it can be hard to understand from an intuitive level therefore not as transparent as a consumption-based allocation.

The higher the number, the more mature/higher the level of granularity of the data. Sometimes it is best to start at a level 1 maturity and work our way up to a level 8. Dhamdhere (2014), suggests that for best practice consumption-based allocation (i.e. activity-based costing) is a best practice when it comes to allocation strategies given its focus on accuracy. However, some levels of maturity may not be acceptable in certain cases (ETL-tools, n.d.). Please take into consideration what is the best allocation strategy for yourself, for the task before you, and the expectations of the stakeholders.

Resources:

Foul on the X-axis and more

There are multiple ways to use data to justify any story or agenda one has. My p-hacking post shows how statistics have been used to get statistically significant results. Therefore you can get your work to publish, and with journal articles and editors not glorifying replication studies, it can be hard to fund them. However, there are also ways to manipulate graphs to meet any narrative you want. Take the figure below, which was published by the Georgia Department of Public Health Website on May 10, 2020. Notice something funny going on in the x-axis, it looks like a Dr. Who’s voyage across time trying to solve the Corona Virus crisis. The dates on the x-axis are not in chronological order (Bump, 2020; Fowler, 2020, Mariano & Trubey, 2020; McFall-Johnsen, 2020, Wallace, 2020). The dates are in the order they need to be, to make it appear that the number of coronavirus cases in Georgia’s top 5 impacted counties is decreasing over time.

Figure 1: May 10 top five impacted counties bar chart from the Georgia Department of Public Health website.

The figure above, if the dates were lined up appropriately would tell a different story. Once this chart was made public, it garnered tons of media coverage and was later fixed. But, this happens all the time when people have an agenda. They mess with the axis, to give them the result they want. It is really rare though to see a real-life example of it on the x-axis.

But wait, there’s more! Notice the grouping order of the top five impacted counties. Pick a color, it looks like the Covid-19 counts per county are playing musical chairs. What was done here was, they ordered each day as top five counties in descending count order, which makes it even harder to understand and interpret, again sewing a narrative that may not be accurate (Bump, 2020; Fowler, 2020, Mariano & Trubey, 2020; McFall-Johnsen, 2020, Wallace, 2020).

Now according to Fowler (2020), there are issues in how the number of Covid-19 cases gets counted here, which adds to misinformation and sews further distrust. It is just another way to build a narrative you wish you had, but carving out an explicit definition of what is in and what is out, you can cause an artificial skew in your data, again to favor a narrative or produce false results that could be accidentally generalized. Here Fowler explains:

“When a new positive case is reported, Georgia assigns that date retroactively to the first sign of symptoms a patient had – or when the test was performed, or when the results were completed. “

Understanding that the virus had many asymptomatic carriers that never got reported is also part of the issue. Understanding that you could be asymptomatic for days and still have Covid-19 in your system, means that the definition above is completely inaccurate. Also, Fowler explains that if there was a Covid-19 test, there is such a backlog of tests, that it could take days to report a positive case, so reporting the last 14 days, these numbers along with the definition will see those numbers shift wildly throughout each iteration of the graph. So, when the figure one was fixed, the last 14 days will inherently show a decrease in cases, due to backlog, definition, and understanding of the virus, see figure 2.

Figure 2: May 19 top five impacted counties bar chart from the Georgia Department of Public Health website.

They did fix the ordering of the counties and the x-axis. But after it was reported by Fox News, Washington Post, and Business Insiders, to report a few. However, the definition of what counts as a Covid-19 case distorts the numbers and still tells the wrong story. It is easy to see this effect when you compare May 4-9 data between Figure 1 and Figure 2. Figure 2 has a higher incidence of Covid-19 recorded, over that same period. That is why definitions and criteria matter just as much as how graphs can be manipulated.

Mariano & Trubey (2020) does have a point, some errors are expected during a time of chaos, but, common chairmanship behavior should be observed. However, be careful of how data is collected, how it is represented on graphs and look at not only the commonly manipulated Y-axis but also the X-axis. That is why the methodology sections in peer-reviewed work are extremely important.

Resources: