Adv DBs: Data Abstractions

Data Abstraction

Text can be abstracted for information and knowledge through either hard clustering where a word has only one connection or soft clustering where a word can have multiple connections to other words (Kulkarni & Kinariwala, 2013).  Clustering, in general, is grouping things together with similar characteristics.  It is hard to do hard clustering with sentences of a paragraph or even prose because they are interconnected with the sentences above and below it.  Also, clusters within prose can overlap with each other.  Thus, it is proposed that soft clustering should be used for the analysis of sentences within the prose.  The method proposed in Kulkarni & Kinariwala is Page Rank, in order to show the importance of a sentence(s) within a document (thus helping summarize a document). The weakness of this paper lies with the fact that they propose an idea without testing it.  They didn’t develop any code or analyzed any data set to say whether their hypothesis on page rank was correct.  Thus, this was a wonderful thought experiment.  The strength if proven correct with other studies is that it maps out the limitations & strengths of hard and soft clustering in data mining within prose and between the prose of a similar nature.

Management issues in systems development

Information is seen as of great value to humanitarian efforts to accomplish their missions.  In Van de Walle & Comes (2015), they state that the United Nations had delivered methods for humanitarian Information Management revolving around checking, sharing, and use of the data.  Checking data revolves around reliability and verifiability, sharing data revolves around interoperability (data formats), accessibility, and sustainability, whereas the use of data deals with timeliness and relevance.  After interviewing humanitarians in two different disaster scenarios, Syria and Typhoon Haiyan for about 1-1.5 hours, they were able to conclude that standard processes can be followed for natural disasters like a landfalling hurricane.  Standard processes lent itself to inflexibility and not meeting all the intricate needs. In a more complicated relief effort like in Syria, confidentiality and unreliable data sources (sometimes coming in the format like an old spy movie, under the table, etc.), affected the entire process.  Finally, this small sample size of two events and humanitarian people interviewed suggest that further research is definitely needed before generalizations in developing systems of Information Management between natural disasters and geopolitical disasters can be made. The main strength of this paper is the analysis of breaking down information management of disasters with respect to standards imposed by the UN.  It also illustrates that information management is end-to-end.  My research hopes to help improve pre-disaster conditions and their research covers aid for post-disaster.  The same disaster, Hurricane landfalling, has a change in key information that is needed to carry out their respective tasks.  In other words, hurricane wind speeds are no longer needed after it passed over a city and left a wake of destruction, and the death toll is not important before the hurricane makes landfall.   But, we need wind speeds to improve forecasts and mitigate death tolls, and we need the current death toll, to make sure we can keep it from rising after the disaster has struck.

References

  • Van de Walle, B. & Comes, T. (2015) On the Nature of Information Management in Complex and Natural Disasters. Procedia Engineering, Pages 403-411.
  • Kulkarni, B. M., & Kinariwala, S. A. (2013). Review on Fuzzy Approach to Sentence Level Text Clustering. International Journal of Scientific Research and Education. Pages 3845-3850.

Adv DBs: A possible future project?

Below is a possible future research paper on a database related subject.

Title: Using MapReduce to aid in clinical test utilization patterns in the medicine

The motivation:

Efficient processing and analysis of clinical data could aid in better clinical tests on patients, and MapReduce solutions allow for an integrated solution in the medical field, which aids in saving resources when it comes to moving data in and out of storage.

The problem statement (symptom and root cause)

The rates of Sexually Transmitted Infections (STIs) are increasing at alarming rates, could the addition of Roper Saint Francis Clinical Network in the South test utilization patterns into Hadoop with MapReduce reveal patterns in the current STIs population and predict areas where an outbreak may be imminent?

The hypothesis statement (propose a solution and address the root cause)

H0: Data mining in Hadoop with MapReduce will not be able to identify any meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

H1: Data mining in Hadoop with MapReduce can identify a meaningful pattern that could be used to predict the next location for an STI outbreak using clinical test utilization patterns.

The research questions

Could this study apply to STIs outbreaks rates be generalized into other disease outbreak rates?

Is this application of data-mining in Hadoop with MapReduce the correct way to analyze the data?

The professional significance statement (new contribution to the body of knowledge)

Identifying where an outbreak of any disease (or STIs), via clinical tests utilization patterns has yet to be done according to Mohammed et al (2014), and they have stated that Hadoop with MapReduce is a great tool for clinical work because it has been adopted in similar fields of medicine like bioinformatics.

Resources

  • Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce programming framework to clinical big data analysis: Current landscape and future trends. Biodata Mining, 7. doi:http://dx.doi.org/10.1186/1756-0381-7-22 – Doctoral Library Advanced Technologies & Aerospace CollectionPokorny, J. (2011).
  • NoSQL databases: A step to database scalability in web environment. In iiWAS ’11 Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services (pp. 278-283). – Doctoral Library ACM Digital Library

Qualitative Methods: Questions distinctions

Qualitative Research Questions distinctions

Usually, qualitative research methods start off with an open-ended central question or two with the words “what” or “how” on a single phenomenon or concept, in order to suggest an exploratory design.  The rest of the question uses exploratory verbs (report, describe, discover, seek, explore, etc) in a non-directional manner as not to suggest causation.  The research question could also be asked in a way to suggest what qualitative research methodological tool you will use to analyze the data, i.e. using the words opinions could mean interviews. Finally, the research question could include the key defining features of the participants in the study (teen, women, men, veterans, people with disabilities, etc.) (Creswell, 2014).

Central research question:

So, an example of a central question could be to do a follow on study on the results of my doctoral research.  So:

What are the opinions on the results from the use of text analytics on tropical discussions to discover weather constructs that positively and negatively affecting hurricane forecast skills perceived and used by the hurricane specialist at the National Hurricane Center?

References:

Qualitative Research: Sampling

Purposive and Theoretical Sampling:

When identifying means for recording data, one must be wary in qualitative research to how they collect data as well, it can be via unstructured or semi-structured observations and interviews, documents, and visual materials (Creswell, 2014).  Purposeful sampling is to help select the (1) actors, (2) events, (3) setting, and (4) process that will best allow the researcher to get a firm grasp at understanding and addressing their central questions and sub-questions in their study.  Also, consider how many sites and participants there should be in the study (identifying your sample size).  The sample size can vary from 1-2 in narrative research, 3-10 in grounded theory, 20-30 for ethnographic studies, and 4-5 cases in case studies (Creswell, 2014).

However, you can reach data saturation (when the research stops the data collection because there exists no more new information that would reveal any other insights or properties addressing the question of the research) before any of these aforementioned numbers (Creswell, 2014). Theoretical sampling is theoretically bound around a concept, but this type of sampling touches more on this concept of data saturation.  Thus, when the researcher is trying to understand the data in order to help them define or understand their theory to the point of data saturation, rather than reaching a defined number.

Example:

An example of this could come from studying the effects of business decisions affecting the family through analyzing relocation decisions on non-military families. (PROCESS)  Purposefully I would like to sample in this example are three groups of families, ACTORS: those with no children, those with children that are no older than 12 years of age, and those with one or more children over the age of 13.  I want to see if there is a difference between the reactions based on having kids and having kids that are older versus younger, over the past decade (EVENT) at Boeing (SETTING).  I could aim for 20-30 families per group to a total of 60-90 sample size, or I could aim for data saturation between each of these groups (Theoretically sampling).  If I want to stick with 60-90 as a total sample size, I could aim for an open answer survey or conduct interviews (which is more costly on my end).  If I wish to aim for data saturation, it can be more easily done with interviews.

References:

Adv DBs: Unsupervised and Supervised Learning

Unsupervised and Supervised Learning:

Supervised learning is a type of machine learning that takes a given set of data points, we need to choose a function that gives users a classification or a value.  So, eventually, you will get data points that no longer defines a classification or a value, thus the machine now has to solve for that function. There are two main types of supervised learning: Classification (has a finite set, i.e. based on person’s chromosomes in a database, their biological gender is either male or female) and Regression (represents real numbers in the real space or n-dimensional real space).  In regression, you can have a 2-dimensional real space, with training data that gives you a regression formula with a Pearson’s correlation number r, given a new data point, can the machine use the regression formula with correlation r to predict where that data point will fall on in the 2-dimensional real space (Mathematicalmonk’s channel, 2011a).

Unsupervised learning aims to uncover homogenous subpopulations in databases (Connolly & Begg, 2015). In Unsupervised learning you are given data points (values, documents, strings, etc.) in n-dimensional real space, the machine will look for patterns through either clustering, density estimation, dimensional reduction, etc.  For clustering, one could take the data points and placing them in bins with common properties, sometimes unknown to the end-user due to the vast size of the data within the database.  With density estimation, the machine is fed a set of probability density functions to fit the data and it begins to estimates the density of that data set.  Finally, for dimensional reduction, the machine will find some lower dimensional space in which the data can be represented (Mathematicalmonk’s channel, 2011b).  With the dimensional reduction, it can destroy the structure that can be seen in the higher-order dimensions.

Applications suited to each method

  • Supervised: defining data transformations (Kelvin to Celsius, meters per second to miles per hour, classifying a biological male or female given the number of chromosomes, etc.), predicting weather (given the initial & boundary conditions, plug them into formulas that predict what will happen in the next time step).
  • Unsupervised: forecasting stock markets (through patterns identified in text mining news articles, or sentiment analysis), reducing demographical database data to common features that can easily describe why a certain population will fit a result over another (dimensional reduction), cloud classification dynamical weather models (weather models that use stochastic approximations, Monte Carlo simulations, or probability densities to generate cloud properties per grid point), finally real-time automated conversation translators (either spoken or closed captions).

Most important issues related to each method

Unsupervised machine learning is at the bedrock of big data analysis.  We could use training data (a set of predefined data that is representative of the real data in all its n-dimensions) to fine-tune the most unsupervised machine learning efforts to reduce error rates (Barak & Modarres, 2015). What I like most about unsupervised machine learning is its clustering and dimensional reduction capabilities, because it can quickly show me what is important about my big data set, without huge amounts of coding and testing on my end.

References:

Adv DBs: Data warehouses and OLAP

Data warehouses allow for people with decision power to locate the adequate data quickly from one location that spans across multiple functional departments and is very well integrated to produce reports and in-depth analysis to make effective decisions (MUSE, 2015a). Data could be stored in n-dimensional data cubes that can be dissected, filtered through, rolled up into a dynamic application called Online analytical processing (OLAP). OLAP can be its own system or part of a data warehouse, and if it’s combined with data mining tools it creates a decision support system (DSS) to uncover/discover hidden relationships within the data (MUSE, 2015b). DSS needs both a place to store data and a way to store data.  The data warehouse doesn’t solve they “Why?” questions, but the “How?, What?, When?, Where?” and that is where OLAP helps.  We want to extract as much knowledge as possible for decision making from these systems, hence this explains why we need both in DSS to solve all questions not just a subset.  But, as aforementioned that data mining tools are also needed for a DSS.

Data Warehouses

Discovering knowledge through archives of data from multiple sources in a consolidated and integrated way is what this warehouse does best.  They are subject-oriented (organized by customers, products, sales, and not in the invoice, product sales), integrated (data from different sources in the enterprise perhaps in different formats), time-variant (varies with respect to time), and nonvolatile (new data is appended not replacing old).  Suitable applications to feed data can be mainframes, proprietary file systems, servers, internal workstations, external website data, etc., which can be used for analysis and discovering knowledge for effective data-based decision making.  Detailed data can be stored online if it can help support/supplement summarized data, so data warehouses can be technically light due to summarized data.  Summarized data, which is updated automatically as new data enters the warehouse, mainly help improve query speeds. So, where is the detailed data: offline (Connolly & Begg, 2015).  Looking into the architecture of this system:

The ODS, Operational data store, holds the data for staging into the data warehouse.  From staging the load manager performs all Extraction and Loading functions to the data into the warehouse, meanwhile, the warehouse manager performs all the Transformation functions to the data into the warehouse.  The query manager performs all the queries into the data warehouse. Metadata (definition of the data stored and its units) are used by all these processes and exist in the warehouse as well (Connolly & Begg, 2015).

 OLAP

Using analytics to answer the “Why?” questions from data that is placed in an n-dimensional aggregate view of the data that is a dynamical system, sets this apart from other query systems.  OLAP is more complex than statistical analysis on aggregated data, it’s more of a slice and dice with time series and data modeling.  OLAP servers come in four main flavors: Multidimensional OLAP (MOLAP: uses multidimensional databases where data is stored per usage), Relational OLAP (ROLAP: supports relational DBMS products with a metadata layer), Hybrid OLAP (HOLAP: certain data is ROLAP and other is in MOLAP), and Desktop OLAP (DOLAP: usually for small file extracts, data is stored in client files/systems like a laptop/desktop).

DSS, OLAP, and Data Warehouse Application

Car insurance claims DSS.  Insurance companies can use this system to analyze patterns of driving from people, what damage can or cannot occur due to an accident, and why someone might claim false damages to fix their cars or cash out.  Thus, their systems can define who, what, when, where, why and how per accident against all other accidents (they can slice and dice by state, type of accident, vehicle types involved, etc) they have processed to help them resolve if a claim is legitimate.

References:

Adv DBs: Data warehouses

Data warehouses allow for people with decision power to locate the adequate data quickly from one location that spans across multiple functional departments and is very well integrated to produce reports and in-depth analysis to make effective decisions (MUSE, 2015). Corporate Information Factory (CIF) and Business Dimensional Lifecycle (BDL) tend to reach the same goal but are applied to different situations with it pros and cons associated with them (Connolly & Begg, 2015).

Corporate Information Factory:

Building consistent and comprehensive business data in a data warehouse to provide data to help meet the business and decision maker’s needs.   This view uses typically traditional databases, to create a data model of all of the data in the entire company before it is implemented in a data warehouse.  From the data warehouse, departments can create (data marts-subset of the data warehouse database data) to meet the needs of the department.  This is favored when we need data for decision making today rather than a few weeks out to a year once the system is set up.  You can see all the data you wish and be able to work with it in this environment.  However, a disadvantage from CIF is that latter point, you can see and work with data in this environment, with no need to wait weeks, months or years for the data you need, and that requires a large complex data warehouse.  This large complex data warehouse that houses all this data you would ever need and more would be expensive and time-consuming to set up.  Your infrastructure costs are high in the beginning, with only variable costs in years to follow (maintenance, growing data structures, adding new data streams, etc.) (Connolly & Begg, 2015).

This seems like an approach to a newer company, like twitter, would have.  Knowing that in the future they could do really powerful business intelligence analysis on their data, they may have made an upfront investment in their architecture and development team resources to build a more robust system.

Business Dimensional Lifecycles:

In this view, all data needs are evaluated first and thus creates the data warehouse bus matrix (listing how all key processes should be analyzed).   This matrix helps build the databases/data marts one by one.  This approach is best to serve a group of users with a need for a specific set of data that need it now and don’t want to wait for the time it would take to create a full centralized data warehouse.  This provides the perk of scaled projects, which is easier to price and can provide value on a smaller/tighter budget.  This has its drawbacks, as we satisfy the needs and wants for today, small data marts (as oppose to the big data warehouse) would be set up, and corralling all these data marts into a future warehouse to provide a consistent and comprehensive view of the data can be an uphill battle. This, almost ad-hoc solutions may have fixed cost spread out over a few years and variable costs are added to the fixed cost (Connolly & Begg, 2015).

This seems like an approach a cost avoiding company that is huge would go for.  Big companies like GE, GM, or Ford, where their main product is not IT it’s their value stream.

The ETL:

To extract, transform and load (ETL) data from sources (via a software) will vary based on the data structures, schema, processing rules, data integrity, mandatory fields, data models, etc.  ETL can be done quite easily in a CIF context, because all the data is present, and can be easily used and transformed to be loaded to a decision-maker, to make appropriate data-driven decisions.  With the BDL, not all the data will be available at the beginning until all of the matrices is developed, but then each data mart holds different design schemas (star, snowflake, star-flake) which can add more complexity on how fast the data can be extracted and transformed, slowing down the ETL (MUSE, 2015).  In CIF all the data is in typical databases and thus in a single format.

References:

Parallel Programming: Msynch 

pseudo code

class Msynch  {
     int replies;
     int currentState = 1;
        synchronized void acquire ( ) {
 // Called by thread wanting access to a critical section
              while (currentState != 1) {wait ( );}
              replies = 0; currentState = 2;
              //         
              // (Here, 5 messages are sent)         
              //         
        while (replies < 5) {wait ( );} // Await 5 replies         
              currentState = 3;    
        }   
        synchronized void replyReceived ( ) {
        // Called by communication thread when reply is received         
              replies++;         
              notifyAll ( );
        }    
        synchronized void release ( ) {
        // Called by a thread releasing the critical section    
              currentState = 1;        
              notifyAll ( );   
        }
}

 

class Msynch1  {    
        int replies;    
        int currentState = 1;      
        synchronized void acquire ( ) {
       // Called by thread wanting access to a critical section         
              while (currentState != 1) {yield ( );}         
              replies = 0; currentState = 2;         
              //         
              // (Here, 5 messages are sent)         
              //         
              if (replies < 5) {wait ( );} // Await 5 replies         
              currentState = 3;    
        }     
        synchronized void replyReceived ( ) {
        // Called by communication thread when reply is received
              replies++;
        }       
        synchronized void release ( ) {
        // Called by a thread releasing the critical section    
              currentState = 1;        
              notifyAll ( );   
        } 
}

From the two sets of code above, I have highlighted three differences (three synchronization related errors in Msynch1) in three different colors, as identified by a line-by-line code comparison.  The reason why

  1. notifyAll ( ); is missing, therefore this code won’t work because without this line of code we will not be able to unblock any other thread. Thus, this error will not allow for replyReceived ( ) to be synchronized. This missing part of the code should activate all threads in the wait set to allow them to compete with each other for the lock based on priorities.
  2. {yield ( );} won’t work is because it won’t block until queue not full like in {wait ( );}. Thus, the wait( ) function, which aids in releasing the threads lock is needed. When a thread calls wait( ) it will unlock the object.  After returning from wait( ) call, it will re-lock the object.
  3. if won’t work is because the wait( ) call should be in a “wait loop”: while (condition {wait( ); } as shown in Msynch. Without it, the thread cannot retest the condition after it returns from the wait ( ) call. With the if-statement, the condition is only tested once, unlike with the while-statement.

An additional fourth error was identified after reviewing the notes from class in Java threads shortened_1.

  1. Best practices are not followed in either Msynch and Msynch1 where the wait loop must actually reside in a try block, as follows: while condition) try {wait( );)}.

When the first thread (appThread) in Msynch calls acquire( ) the first time, it currentState = 1, so it enters into the wait loop.  Thus, its replies are initialized at zero and currentState =1.  This thread sends out 5 messages to other threads (calling on replyReceived( )).  As long as replies are less than five it stays locked and the currentState remains equal to two.  Once it has received 5 replies from any five threads, the code will unlock and it increments the currentState by one, so now it is equaled to three.

As the initial thread  (appThread) running in aquire( ) calls out other threads (commThreads) for at least 5 messages, these other threads do so by calling replyRecieved( ).  This code increments the number of replies by one each time a thread calls it and unlocks replies so it can increment it by one and then locks it so that another thread calling replyReceived( ) can increment it by one.  Thus, once five threads, any five threads can successfully run replyReceived( ), then we can increment currentState =3 as the lock on currentState is removed.

Words to define

Semaphores: In a coded programmed, where there exists a code between a lock and unlock instruction, it becomes known as a critical section.  That critical section can only be accessed by one thread at a time.  If a thread sees a semaphore that is open, the thread will close it in one uninterruptible and automatic operation, and if that was successful the thread can confidently proceed into the critical section.  When the thread completes its task in the critical section, it will reopen the semaphore. This, changes of state in the semaphore are important, because if a thread sees that the semaphore is closed that thread stalls.  Thus, this ensures that only one thread at a time can work on the critical section of the code (Oracle, 2015).

Test-and-set (TSET/TAND/TSL): It is a special set of hardware instructions, which semaphores operate in order to have both read access and conditional write access by testing a bit of the code (Sanden, 2011).  This will eventually allow a thread to eventually work on the critical section of the code.   Essentially, a semaphore is open if the bit is 1 and closed if the bit is 0 (Oracle, 2015).  If the bit is 1, the test set of instructions will attempt to close the semaphore by changing and setting the bit to 0.  This TSET is conducted automatically and cannot be interrupted.

Preemption: Multi-threading can be set with or without each thread having an assigned priority preemptively.  If a priority is set amongst the threads, a thread can be suspended (forestalled) by a higher-priority thread at any one time.  This becomes important when the ratio of thread to computer cores is high (like 10 threads on a uniprocessor).  Preemption becomes obsolete when the number of threads is less than the number of cores provided for the code to run (Sanden, 2011).

References

Parallel Programming: Practical examples of a thread

Here is a simple problem: A boy and a girl toss a ball back and forth to each other. Assume that the boy is one thread (node) and the girl is another thread, and b is data.

Boy = m

Girl = f

Ball = b

  • m has b
    1. m throws b –> f catches b
  • f has b
    1. f throws b –> m catches b

Assuming we could drop the ball, and holding everything else constant.

  • m has b
    1. m throws b –> f catches b
    2. m throws b –> f drops b
      1. f picks up the dropped b
  • f has b
    1. f throws b –> m catches b
    2. f throws b –> m drops b
      1. m picks up the dropped b

 

Suppose you add a third player.

Boy = m

Girl = f

Ball = b

3rd player = x

  • m has b
    1. m throws b –> f catches b
    2. m throws b –> x catches b
  • f has b
    1. f throws b –> m catches b
    2. f throws b –> x catches b
  • x has b
    1. x throws b –> m catches b
    2. x throws b –> f catches b

Assuming we could drop the ball, and holding everything else constant.

  • m has b
    1. m throws b –> f catches b
    2. m throws b –> f drops b
      1. f picks up the dropped b
    3. m throws b –> x catches b
    4. m throws b –> x drops b
      1. x picks up the drooped b
  • f has b
    1. f throws b –> m catches b
    2. f throws b –> m drops b
      1. m picks up the dropped b
    3. f throws b –> x catches b
    4. f throws b –> x drops b
      1. x picks up the dropped b
  • x has b
    1. x throws b –> m catches b
    2. x throws b –> m drops b
      1. m picks up the dropped b
    3. x throws b –> f catches b
    4. x throws b –> f drops b
      1. f picks up the dropped b

Will that change the thread models? What if the throwing pattern is not static; that is, the boy can throw to the girl or to the third player, and so forth? 

In this example: Yes, there is an additional thread that gets added, because each player is a tread that can catch or drop a ball.  Each player is a thread on its own, transferring data ‘b’ amongst them and throwing the ‘b’ is locking the data before transferring and catching ‘b’ is unlocking the data.  After the ball is dropped (maybe calculated randomly), the player with the ball now has to pick it up, which can be equivalent to analyze the data based on a certain condition that is met like account balance is < 500 or else.  The model changes with the additional player because each person has a choice to make now on which person should receive the ball next, which is not present in the first model when there were two threads.  If there exists a static toss like

  • f –> m –> x –> f

Then the model doesn’t change, because there is no choice now.