Adv DBs: Data warehouses

Data warehouses allow for people with decision power to locate the adequate data quickly from one location that spans across multiple functional departments and is very well integrated to produce reports and in-depth analysis to make effective decisions (MUSE, 2015). Corporate Information Factory (CIF) and Business Dimensional Lifecycle (BDL) tend to reach the same goal but are applied to different situations with it pros and cons associated with them (Connolly & Begg, 2015).

Corporate Information Factory:

Building consistent and comprehensive business data in a data warehouse to provide data to help meet the business and decision maker’s needs.   This view uses typically traditional databases, to create a data model of all of the data in the entire company before it is implemented in a data warehouse.  From the data warehouse, departments can create (data marts-subset of the data warehouse database data) to meet the needs of the department.  This is favored when we need data for decision making today rather than a few weeks out to a year once the system is set up.  You can see all the data you wish and be able to work with it in this environment.  However, a disadvantage from CIF is that latter point, you can see and work with data in this environment, with no need to wait weeks, months or years for the data you need, and that requires a large complex data warehouse.  This large complex data warehouse that houses all this data you would ever need and more would be expensive and time-consuming to set up.  Your infrastructure costs are high in the beginning, with only variable costs in years to follow (maintenance, growing data structures, adding new data streams, etc.) (Connolly & Begg, 2015).

This seems like an approach to a newer company, like twitter, would have.  Knowing that in the future they could do really powerful business intelligence analysis on their data, they may have made an upfront investment in their architecture and development team resources to build a more robust system.

Business Dimensional Lifecycles:

In this view, all data needs are evaluated first and thus creates the data warehouse bus matrix (listing how all key processes should be analyzed).   This matrix helps build the databases/data marts one by one.  This approach is best to serve a group of users with a need for a specific set of data that need it now and don’t want to wait for the time it would take to create a full centralized data warehouse.  This provides the perk of scaled projects, which is easier to price and can provide value on a smaller/tighter budget.  This has its drawbacks, as we satisfy the needs and wants for today, small data marts (as oppose to the big data warehouse) would be set up, and corralling all these data marts into a future warehouse to provide a consistent and comprehensive view of the data can be an uphill battle. This, almost ad-hoc solutions may have fixed cost spread out over a few years and variable costs are added to the fixed cost (Connolly & Begg, 2015).

This seems like an approach a cost avoiding company that is huge would go for.  Big companies like GE, GM, or Ford, where their main product is not IT it’s their value stream.

The ETL:

To extract, transform and load (ETL) data from sources (via a software) will vary based on the data structures, schema, processing rules, data integrity, mandatory fields, data models, etc.  ETL can be done quite easily in a CIF context, because all the data is present, and can be easily used and transformed to be loaded to a decision-maker, to make appropriate data-driven decisions.  With the BDL, not all the data will be available at the beginning until all of the matrices is developed, but then each data mart holds different design schemas (star, snowflake, star-flake) which can add more complexity on how fast the data can be extracted and transformed, slowing down the ETL (MUSE, 2015).  In CIF all the data is in typical databases and thus in a single format.

References:

Parallel Programming: Msynch 

pseudo code

class Msynch  {
     int replies;
     int currentState = 1;
        synchronized void acquire ( ) {
 // Called by thread wanting access to a critical section
              while (currentState != 1) {wait ( );}
              replies = 0; currentState = 2;
              //         
              // (Here, 5 messages are sent)         
              //         
        while (replies < 5) {wait ( );} // Await 5 replies         
              currentState = 3;    
        }   
        synchronized void replyReceived ( ) {
        // Called by communication thread when reply is received         
              replies++;         
              notifyAll ( );
        }    
        synchronized void release ( ) {
        // Called by a thread releasing the critical section    
              currentState = 1;        
              notifyAll ( );   
        }
}

 

class Msynch1  {    
        int replies;    
        int currentState = 1;      
        synchronized void acquire ( ) {
       // Called by thread wanting access to a critical section         
              while (currentState != 1) {yield ( );}         
              replies = 0; currentState = 2;         
              //         
              // (Here, 5 messages are sent)         
              //         
              if (replies < 5) {wait ( );} // Await 5 replies         
              currentState = 3;    
        }     
        synchronized void replyReceived ( ) {
        // Called by communication thread when reply is received
              replies++;
        }       
        synchronized void release ( ) {
        // Called by a thread releasing the critical section    
              currentState = 1;        
              notifyAll ( );   
        } 
}

From the two sets of code above, I have highlighted three differences (three synchronization related errors in Msynch1) in three different colors, as identified by a line-by-line code comparison.  The reason why

  1. notifyAll ( ); is missing, therefore this code won’t work because without this line of code we will not be able to unblock any other thread. Thus, this error will not allow for replyReceived ( ) to be synchronized. This missing part of the code should activate all threads in the wait set to allow them to compete with each other for the lock based on priorities.
  2. {yield ( );} won’t work is because it won’t block until queue not full like in {wait ( );}. Thus, the wait( ) function, which aids in releasing the threads lock is needed. When a thread calls wait( ) it will unlock the object.  After returning from wait( ) call, it will re-lock the object.
  3. if won’t work is because the wait( ) call should be in a “wait loop”: while (condition {wait( ); } as shown in Msynch. Without it, the thread cannot retest the condition after it returns from the wait ( ) call. With the if-statement, the condition is only tested once, unlike with the while-statement.

An additional fourth error was identified after reviewing the notes from class in Java threads shortened_1.

  1. Best practices are not followed in either Msynch and Msynch1 where the wait loop must actually reside in a try block, as follows: while condition) try {wait( );)}.

When the first thread (appThread) in Msynch calls acquire( ) the first time, it currentState = 1, so it enters into the wait loop.  Thus, its replies are initialized at zero and currentState =1.  This thread sends out 5 messages to other threads (calling on replyReceived( )).  As long as replies are less than five it stays locked and the currentState remains equal to two.  Once it has received 5 replies from any five threads, the code will unlock and it increments the currentState by one, so now it is equaled to three.

As the initial thread  (appThread) running in aquire( ) calls out other threads (commThreads) for at least 5 messages, these other threads do so by calling replyRecieved( ).  This code increments the number of replies by one each time a thread calls it and unlocks replies so it can increment it by one and then locks it so that another thread calling replyReceived( ) can increment it by one.  Thus, once five threads, any five threads can successfully run replyReceived( ), then we can increment currentState =3 as the lock on currentState is removed.

Words to define

Semaphores: In a coded programmed, where there exists a code between a lock and unlock instruction, it becomes known as a critical section.  That critical section can only be accessed by one thread at a time.  If a thread sees a semaphore that is open, the thread will close it in one uninterruptible and automatic operation, and if that was successful the thread can confidently proceed into the critical section.  When the thread completes its task in the critical section, it will reopen the semaphore. This, changes of state in the semaphore are important, because if a thread sees that the semaphore is closed that thread stalls.  Thus, this ensures that only one thread at a time can work on the critical section of the code (Oracle, 2015).

Test-and-set (TSET/TAND/TSL): It is a special set of hardware instructions, which semaphores operate in order to have both read access and conditional write access by testing a bit of the code (Sanden, 2011).  This will eventually allow a thread to eventually work on the critical section of the code.   Essentially, a semaphore is open if the bit is 1 and closed if the bit is 0 (Oracle, 2015).  If the bit is 1, the test set of instructions will attempt to close the semaphore by changing and setting the bit to 0.  This TSET is conducted automatically and cannot be interrupted.

Preemption: Multi-threading can be set with or without each thread having an assigned priority preemptively.  If a priority is set amongst the threads, a thread can be suspended (forestalled) by a higher-priority thread at any one time.  This becomes important when the ratio of thread to computer cores is high (like 10 threads on a uniprocessor).  Preemption becomes obsolete when the number of threads is less than the number of cores provided for the code to run (Sanden, 2011).

References

Parallel Programming: Practical examples of a thread

Here is a simple problem: A boy and a girl toss a ball back and forth to each other. Assume that the boy is one thread (node) and the girl is another thread, and b is data.

Boy = m

Girl = f

Ball = b

  • m has b
    1. m throws b –> f catches b
  • f has b
    1. f throws b –> m catches b

Assuming we could drop the ball, and holding everything else constant.

  • m has b
    1. m throws b –> f catches b
    2. m throws b –> f drops b
      1. f picks up the dropped b
  • f has b
    1. f throws b –> m catches b
    2. f throws b –> m drops b
      1. m picks up the dropped b

 

Suppose you add a third player.

Boy = m

Girl = f

Ball = b

3rd player = x

  • m has b
    1. m throws b –> f catches b
    2. m throws b –> x catches b
  • f has b
    1. f throws b –> m catches b
    2. f throws b –> x catches b
  • x has b
    1. x throws b –> m catches b
    2. x throws b –> f catches b

Assuming we could drop the ball, and holding everything else constant.

  • m has b
    1. m throws b –> f catches b
    2. m throws b –> f drops b
      1. f picks up the dropped b
    3. m throws b –> x catches b
    4. m throws b –> x drops b
      1. x picks up the drooped b
  • f has b
    1. f throws b –> m catches b
    2. f throws b –> m drops b
      1. m picks up the dropped b
    3. f throws b –> x catches b
    4. f throws b –> x drops b
      1. x picks up the dropped b
  • x has b
    1. x throws b –> m catches b
    2. x throws b –> m drops b
      1. m picks up the dropped b
    3. x throws b –> f catches b
    4. x throws b –> f drops b
      1. f picks up the dropped b

Will that change the thread models? What if the throwing pattern is not static; that is, the boy can throw to the girl or to the third player, and so forth? 

In this example: Yes, there is an additional thread that gets added, because each player is a tread that can catch or drop a ball.  Each player is a thread on its own, transferring data ‘b’ amongst them and throwing the ‘b’ is locking the data before transferring and catching ‘b’ is unlocking the data.  After the ball is dropped (maybe calculated randomly), the player with the ball now has to pick it up, which can be equivalent to analyze the data based on a certain condition that is met like account balance is < 500 or else.  The model changes with the additional player because each person has a choice to make now on which person should receive the ball next, which is not present in the first model when there were two threads.  If there exists a static toss like

  • f –> m –> x –> f

Then the model doesn’t change, because there is no choice now.

Finance 102: Financial Categories

We can categorize financial savings into common groups based on different types of expenses:

  • Hard Savings: Short-term variable expenses, which fall under expense management (Hawley, 2018). It is also known as Cost Savings (Dawson, 2018). In other words, savings that have a direct impact now, like cost reduction or revenue enhancement (WarehouseBlueprint, 2017). Here we want to have proactive disciplined spending to help fund investment opportunities down the line (Hawley, 2018). For instance, if you reduce your outside dining experience over the next month or two, to save money.
  • Cost Avoidance: Consumption of resources, this falls under part of optimization (Hawley, 2018). Another way to look at cost avoidance is to lower your potential incurred future expenses by reducing the gap in financial losses (Dawson, 2018; Study.com, n.d.). You will need to be managing and planning for your needs and wants (Hawley, 2018). Here, this reminds me of using everything you got in the kitchen before you buy a new set of groceries. Find new recipes and reduce your costs of buying more food, while older food gets unused and expires. The avoidance of expiring food is cost avoidance.
  • Potential Savings: Consumption of resources, which also falls under part of optimization. You should be identifying new opportunities and offerings to meet your needs and wants (Hawley, 2018). This is usually longer-term savings, like finding ways to cut your budget for the long-term, like reducing your phone bill to just a cell-phone bill or removing an unused gym membership expense from your budget. Another example is when you begin shifting your energy use to off energy peak hours.
  • Write-downs: Amortization and Depreciation, which is essentially debt that needs to following financial compliance and to adjust the balance sheet numbers (Hawley, 2018; Investing Answers, 2019). This is the reduction of the book value of an asset-based (Investing Answers, 2019). For example, the mortgage on your house and the tax write-offs associated with it. Also, there can be the value of your car which depreciates when you drive it off the lot and is based on fundamental changes like mileage, age, and condition.
  • Delayed Savings: Extrapolated values for investments and funded projects, which could depend on a variety of factors (Girosi, et. al., 2005). We should be asking ourselves here if the investments are aligned to our objects, risk profile, and goals (Hawley, 2018). For an illustration, buying solar panels now may incur a cost today, but the eventual savings on the electric bill will be realized in the future.
  • Future Debt: Debt to be incurred when undergoing investments and funded projects (Hawley, 2018). It is a debt that will be created or is created but will not be due today (The Law Dictionary, n.d.). For instance, future rental properties one finances to create cash flow, where a portion of the money will go to paying the mortgage and the rest goes elsewhere.

None of this can come into fruition without periodic reviews of your budget, expenses, and other data points so that you can adjust your plan of action.

Reference:

Parallel Programming: Logical Clocks

In a distributed system nodes can talk (cooperate) to each other and coordinate their systems.  However, the different nodes can execute concurrently, there is no global clock in which all nodes function on, and some of these nodes can fail independently (Sandén, 2011).  Since nodes talk to each other, we must study them as they interact with each other.  Thus, a need to use logical clocks (because we don’t have global clocks) which show that distances in time are lost (Sandén, 2011). In logical clocks: all nodes agree on an order of events, partially (where something can happen before another event).  They only describe the order of events, not with respect to time.  If nodes are completely disjoint in a logical clock, then a node can fail independently. This is one way to visualize the complex nature of nodes.

The following is an example of a logical clock:

Capture

Reference

Parallel Programming: Locks

A lock for a Node could be for the latest service request. Nodes in a group have to agree on which one of them holds the lock during any one moment of time, which can be seen on a vector graph if we note which one holds the lock. A node can release and request a lock.

Mutual exclusion algorithms can have a centralized coordinator node that handles the requests for the lock, which then means if that node fails so will the program (Sandén, 2011). Mutual exclusion algorithms can allow for a contention-based exclusion where nodes compete for the lock equally, and a queue is created for pending requests.  Finally, controlled exclusions have a logical piece of code to visit each node at a regulated interval of time to lend them the lock.

Lamport’s clock can help order the contention-based scenario where every node is trying to get the lock and it can only be had through a queue (Sandén, 2011). The queue tracks their request through a timestamp. Nodes can earn a lock if it has all the reply messages it needs to run its task and it’s on the top of the list in its queue.

Sandén (2011), states that multicast is done to all nodes that the lock has been released, and abolishing this message can optimize the process. Thus, the node should get the request from the next in the queue and postpone it until is done with the lock.

Reference

Parallel Programming: Logical Clocks

Per Sandén (2011), in a distributed system nodes can talk (cooperate) to each other and coordinate their systems.  However, the different nodes can execute concurrently, there is no global clock in which all nodes function on, and some of these nodes can fail independently.  Since nodes talk to each other, we must study them as they interact with each other.  Thus, there is a need to use logical clocks (because we don’t have global clocks) which show that distances in time that is lost.

In logical clocks: all nodes agree on an order of events, partially (where something can happen before another event).  They only describe the order of events, not with respect to time.  If nodes are completely disjoint in a logical clock, then a node can fail independently.

Reference

To Do List: Home Network for Working Remotely

During 2020, with the rise of Corona Virus 2019 (Covid-19) we have seen a rise in working remotely. However, we soon realized that not everyone has the same connectivity speeds in their homes. We also have realized that there are internet connectivity deserts in the United States, where students from K-University may not have reliable access to the internet. Though this post is not going to address internet access and connectivity deserts in the U.S., it will address tips and techniques to help improve connectivity speeds in your home network. Even when people live in areas where internet connectivity can be taken for granted, connection speeds can be variable from home to home, which can impact performance for working remotely. A quick test is if you can stream Disney+, Netflix, Hulu, YouTube, or any other video streaming platform on your devices via a wireless network you should be able to work from home using WebEx, Zoom, etc. Essentially, 20 megabytes per second or greater will suffice. However, 20-30 megabytes per second is pushing it, especially if you are not single or have many other devices.

Your wireless connection is based on bandwidth and depending on the type of connection plan with your service provider that plan has a usually fixed amount, i.e. 20 megabytes per second, 50 megabytes per second, 1 gigabyte per second, etc. Therefore, it is imperative to squeeze every byte effectively. How many devices do you have connected to your network? For me, I have at least 8 devices on my network, with usually 5 connected to it at any one moment in time. Even though your cell phone is not streaming anything, it is still interacting with your network, consuming a small bit of your network’s bandwidth.

However, if I have 5 items connected to the network, I do not have them all streaming something at the same time. If we can do a crude extrapolation out from my anecdotal case, for a family of four, there could be about 20 devices connected to one network. What do we do? Consider budgeting and prioritizing streaming times, for a more cost-effective solution. However, depending on where you live you may be able to contact your service provider to increase your bandwidth. Also, check to see if your house or neighborhood has been wired up for Fiber or just DSL (fiber-optic connection is best as of the writing of the post) and switch if you are not on Fiber. Fiber connection allows for a higher connectivity speed.

Age plays a role in internet connectivity speeds, even if you have paid for the higher speeds from your provider. The older the house, the older the wiring to connect to the internet provider. With time, internet cable connections can degrade which can also impact performance. Also, the age of the router is important. The older the router, it may not be compatible with receiving higher speeds from your service provider. If you are renting one, contact your service provider to upgrade the router for free. If you have purchased one (recommended and more cost-effective), it may be time to upgrade it if it is old.

Even with newer routers, does it have the latest security patch? Computer viruses can impact your routers and degrade your connectivity speeds. It is always wise, regardless of your device to accept new upgrades and security patches. There is a caveat here, given the haste that some patches come out, I personally and typically wait some time depending on the need for a patch (if the current vulnerability is high the sooner I accept the patch), before I install one. Sometimes, security patches may cause more issues if installed in haste than warranted.

When you have set up wireless in your house, you may have dead spots or streaming bottlenecks. It is often best to test your connection speed by using apps a connection speed test website like www.speedtest.net. Start testing in the same room and nearest the router. There are two goals to testing near the router: (1) to test to see if your speed is within 5-10 megabytes per second to what you are paying for from your internet provider, (2) to set up a baseline of what you should be connecting to around the house. If you are not getting the right speed according to your contract, check the age and security patches on your router and retest again. If not, contact your service provider to address the situation. Once within the 5-10 megabyte per second variance from your contract, you should go around and test each room. You will see that as the further away from the router the connection speed may drop. You can also see that the connection in different rooms may vary significantly too. Different devices based on their age may also have a variable connection speed. A 6-year-old laptop may have slower connection speeds than the latest mobile device. 

To get the best solution when working from home is to install your router and wireless connection in the same room as your working station and if at all possible use the wired connection from your laptop to your router direct, bypassing wireless all together for your work device. Now, if you are testing the speed of your connectivity while connected via wire to the router is significantly faster than if it were connected via wireless, you then know that there may be an issue with the wireless. You can either buy a new wireless box if yours is old or troubleshoot from that point moving forward.

In summary, once you have addressed the above when working from home, reconsider the following to preserve your bandwidth: (1) limit as you can unnecessary video chatting via online meeting platforms; (2) limit streaming video on your work device; (3) limit streaming music on your work device; and (4) limit the number of devices connected and streaming data to your at-home network. This is because live video takes more data to stream than prerecorded video (from YouTube, Netflix, etc.) than streaming live voice over the network (Voice over internet protocol) than streaming music (Pandora, google music, etc.).

Parallel Programming: Deadlocks

Deadlock occurs while you are getting an additional resource while holding another or more resource, especially when it creates a circularity  (Sandén, 2011).

Sandén (2011), stated that to prevent deadlocks, resources need to be controlled.  One should do a wait chain diagram to make sure your design can help prevent a deadlock.  Especially when there is a mix of transactions occurring.  It is also best to know how many threads/entities are needed to be called on simultaneously before a deadlock can occur, especially true when you have multiple threads calling on shared resources.

Thus, we should manage the resources to ensure no circularity, limit the number of entities to just below the threshold to cause a deadlock, eliminate wait.

There are many in real life like the one shown in Sandén (2011) with each of 4 cars halfway into an intersection. The following is a real-life suggested a deadlock scenario:  

There is one set of measuring cup (1/2 a cup).  There are no other ways to measure this amount.  Jack and Jill are backing a cake at the same time.  They have all the objects need, eggs, cake mix, oil, and milk.  However, they need the only measuring cup to measure oil and milk and they reach for it at the same time.  This is a deadlock.

To un-deadlock this scenario, Jack can pour the eggs, and cake mix, while Jill measures and pours the oil and milk.  When Jill is done, Jack measures and pours the oil and milk and Jill pours the cake mix and eggs.  The same could be done with up to four people.  Where each person is a thread and the measuring cup is the resource.

Once we introduce a fifth or more person, the wait chain has unnecessarily long periods of wait for one thread to be able to begin to use a resource.

Reference