Adv DBs: Data warehouses

Data warehouses allow for people with decision power to locate the adequate data quickly from one location that spans across multiple functional departments and is very well integrated to produce reports and in-depth analysis to make effective decisions (MUSE, 2015). Corporate Information Factory (CIF) and Business Dimensional Lifecycle (BDL) tend to reach the same goal but are applied to different situations with it pros and cons associated with them (Connolly & Begg, 2015).

Corporate Information Factory:

Building consistent and comprehensive business data in a data warehouse to provide data to help meet the business and decision maker’s needs.   This view uses typically traditional databases, to create a data model of all of the data in the entire company before it is implemented in a data warehouse.  From the data warehouse, departments can create (data marts-subset of the data warehouse database data) to meet the needs of the department.  This is favored when we need data for decision making today rather than a few weeks out to a year once the system is set up.  You can see all the data you wish and be able to work with it in this environment.  However, a disadvantage from CIF is that latter point, you can see and work with data in this environment, with no need to wait weeks, months or years for the data you need, and that requires a large complex data warehouse.  This large complex data warehouse that houses all this data you would ever need and more would be expensive and time-consuming to set up.  Your infrastructure costs are high in the beginning, with only variable costs in years to follow (maintenance, growing data structures, adding new data streams, etc.) (Connolly & Begg, 2015).

This seems like an approach to a newer company, like twitter, would have.  Knowing that in the future they could do really powerful business intelligence analysis on their data, they may have made an upfront investment in their architecture and development team resources to build a more robust system.

Business Dimensional Lifecycles:

In this view, all data needs are evaluated first and thus creates the data warehouse bus matrix (listing how all key processes should be analyzed).   This matrix helps build the databases/data marts one by one.  This approach is best to serve a group of users with a need for a specific set of data that need it now and don’t want to wait for the time it would take to create a full centralized data warehouse.  This provides the perk of scaled projects, which is easier to price and can provide value on a smaller/tighter budget.  This has its drawbacks, as we satisfy the needs and wants for today, small data marts (as oppose to the big data warehouse) would be set up, and corralling all these data marts into a future warehouse to provide a consistent and comprehensive view of the data can be an uphill battle. This, almost ad-hoc solutions may have fixed cost spread out over a few years and variable costs are added to the fixed cost (Connolly & Begg, 2015).

This seems like an approach a cost avoiding company that is huge would go for.  Big companies like GE, GM, or Ford, where their main product is not IT it’s their value stream.

The ETL:

To extract, transform and load (ETL) data from sources (via a software) will vary based on the data structures, schema, processing rules, data integrity, mandatory fields, data models, etc.  ETL can be done quite easily in a CIF context, because all the data is present, and can be easily used and transformed to be loaded to a decision-maker, to make appropriate data-driven decisions.  With the BDL, not all the data will be available at the beginning until all of the matrices is developed, but then each data mart holds different design schemas (star, snowflake, star-flake) which can add more complexity on how fast the data can be extracted and transformed, slowing down the ETL (MUSE, 2015).  In CIF all the data is in typical databases and thus in a single format.

References:

Parallel Programming: Msynch 

pseudo code

class Msynch  {
     int replies;
     int currentState = 1;
        synchronized void acquire ( ) {
 // Called by thread wanting access to a critical section
              while (currentState != 1) {wait ( );}
              replies = 0; currentState = 2;
              //         
              // (Here, 5 messages are sent)         
              //         
        while (replies < 5) {wait ( );} // Await 5 replies         
              currentState = 3;    
        }   
        synchronized void replyReceived ( ) {
        // Called by communication thread when reply is received         
              replies++;         
              notifyAll ( );
        }    
        synchronized void release ( ) {
        // Called by a thread releasing the critical section    
              currentState = 1;        
              notifyAll ( );   
        }
}

 

class Msynch1  {    
        int replies;    
        int currentState = 1;      
        synchronized void acquire ( ) {
       // Called by thread wanting access to a critical section         
              while (currentState != 1) {yield ( );}         
              replies = 0; currentState = 2;         
              //         
              // (Here, 5 messages are sent)         
              //         
              if (replies < 5) {wait ( );} // Await 5 replies         
              currentState = 3;    
        }     
        synchronized void replyReceived ( ) {
        // Called by communication thread when reply is received
              replies++;
        }       
        synchronized void release ( ) {
        // Called by a thread releasing the critical section    
              currentState = 1;        
              notifyAll ( );   
        } 
}

From the two sets of code above, I have highlighted three differences (three synchronization related errors in Msynch1) in three different colors, as identified by a line-by-line code comparison.  The reason why

  1. notifyAll ( ); is missing, therefore this code won’t work because without this line of code we will not be able to unblock any other thread. Thus, this error will not allow for replyReceived ( ) to be synchronized. This missing part of the code should activate all threads in the wait set to allow them to compete with each other for the lock based on priorities.
  2. {yield ( );} won’t work is because it won’t block until queue not full like in {wait ( );}. Thus, the wait( ) function, which aids in releasing the threads lock is needed. When a thread calls wait( ) it will unlock the object.  After returning from wait( ) call, it will re-lock the object.
  3. if won’t work is because the wait( ) call should be in a “wait loop”: while (condition {wait( ); } as shown in Msynch. Without it, the thread cannot retest the condition after it returns from the wait ( ) call. With the if-statement, the condition is only tested once, unlike with the while-statement.

An additional fourth error was identified after reviewing the notes from class in Java threads shortened_1.

  1. Best practices are not followed in either Msynch and Msynch1 where the wait loop must actually reside in a try block, as follows: while condition) try {wait( );)}.

When the first thread (appThread) in Msynch calls acquire( ) the first time, it currentState = 1, so it enters into the wait loop.  Thus, its replies are initialized at zero and currentState =1.  This thread sends out 5 messages to other threads (calling on replyReceived( )).  As long as replies are less than five it stays locked and the currentState remains equal to two.  Once it has received 5 replies from any five threads, the code will unlock and it increments the currentState by one, so now it is equaled to three.

As the initial thread  (appThread) running in aquire( ) calls out other threads (commThreads) for at least 5 messages, these other threads do so by calling replyRecieved( ).  This code increments the number of replies by one each time a thread calls it and unlocks replies so it can increment it by one and then locks it so that another thread calling replyReceived( ) can increment it by one.  Thus, once five threads, any five threads can successfully run replyReceived( ), then we can increment currentState =3 as the lock on currentState is removed.

Words to define

Semaphores: In a coded programmed, where there exists a code between a lock and unlock instruction, it becomes known as a critical section.  That critical section can only be accessed by one thread at a time.  If a thread sees a semaphore that is open, the thread will close it in one uninterruptible and automatic operation, and if that was successful the thread can confidently proceed into the critical section.  When the thread completes its task in the critical section, it will reopen the semaphore. This, changes of state in the semaphore are important, because if a thread sees that the semaphore is closed that thread stalls.  Thus, this ensures that only one thread at a time can work on the critical section of the code (Oracle, 2015).

Test-and-set (TSET/TAND/TSL): It is a special set of hardware instructions, which semaphores operate in order to have both read access and conditional write access by testing a bit of the code (Sanden, 2011).  This will eventually allow a thread to eventually work on the critical section of the code.   Essentially, a semaphore is open if the bit is 1 and closed if the bit is 0 (Oracle, 2015).  If the bit is 1, the test set of instructions will attempt to close the semaphore by changing and setting the bit to 0.  This TSET is conducted automatically and cannot be interrupted.

Preemption: Multi-threading can be set with or without each thread having an assigned priority preemptively.  If a priority is set amongst the threads, a thread can be suspended (forestalled) by a higher-priority thread at any one time.  This becomes important when the ratio of thread to computer cores is high (like 10 threads on a uniprocessor).  Preemption becomes obsolete when the number of threads is less than the number of cores provided for the code to run (Sanden, 2011).

References