Adv Quant: Decision Trees

Decision Trees

Humans when facing a decision tend to seek out a path, solution, or option that appears closest to the goal (Brookshear & Brylow, 2014). Decision trees are helpful as they are predictive models (Ahlemeyer-Stubbe & Coleman, 2014).  Thus, decisions tree aid in data abstraction and finding patterns in an intuitive way (Ahlemeyer-Stubbe & Coleman, 2014; Brookshear & Brylow, 2014; Conolly & Begg, 2014) and aid the decision-making process by mapping out all the paths, solutions, or options available for the decision maker to decide upon.  Every decision is different and varies in complexity. Therefore there is no way to write a simple and well thought out decision tree (Sadalage & Fowler, 2012).

Ahlemeyer-Stubbe and Coleman (2014) stated that the decision trees are a great way to identify possible variables for inclusion in statistical models that are mutually exclusive and collectively exhaustive, even if the relationship between the target and inputs are weak. To help facilitate decision making, each node on a decision tree can have questions attached to it that needs to be asked with leaves associated with each node that represents the differing answers (McNurlin, Sprague, & Bui, 2008). The variable with the strongest influence becomes the top most branch of the decision tree (Ahlemeyer-Stubbe & Coleman, 2014). Chaudhuri, Lo, Loh, & Yang (1995) defines regression decision trees as those where the target question/variable is either continuous, real, or logistic yielding. Murthy (1998), confirms this definition for regression decision trees, while also defining that when to target question/variables needs to be split up into different, finite, and discrete classes is what defines classification decision trees.

Aiming to mirror the way human brain works, the classification decision trees can be created by using neural networks algorithms, which contains a connection of nodes that can have multiple inputs, outputs and processes in each node (Ahlemeyer-Stubbe & Coleman, 2014; Connolly & Begg, 2014). Neural network algorithms contrast the typical decision trees, which usually have one input, one output, and one process per node (similar to Figure 1). Once a root question has been identified, the decision tree algorithm keeps recursively iterating through the data, in an aim to answer the root question (Ahlemeyer-Stubbe & Coleman, 2014).

However, the larger the decision tree, the weaker the leaves get, because the model is tending to overfit the data. Thus thresholds could be applied to the decision tree modeling algorithm to prune back the unstable leaves (Ahlemeyer-Stubbe & Coleman, 2014).  Thus, when looking for a decision tree algorithm to parse through data, it is best to find one that has pruning capabilities.

5db1f1

Figure 1: A left-to-right decision tree on whether or not to take an umbrella, assuming the person is going to spend any amount of time outside during the day.

Advantages of a decision tree

According to Ahlemeyer-Stubbe & Coleman (2014) some of the advantages of using decision tress are:

+ Few assumptions are needed about the distribution of the data

+ Few assumptions are needed about the linearity

+ Decision trees are not sensitive to outliers

+ Decision trees are best for large data, because of their adaptability and minimal assumptions needed to begin parsing the data

+ For logistic and linear regression trees, parameter estimation and hypothesis testing are possible

+ For neural network (Classification) decision trees, predictive equations can be derived

According to Murthy (1998) the advantages of using classification decision trees are:

+ Pre-classified examples mitigate the needs for a subject matter expert knowledge

+ It is an exploratory method as opposes to inferential method

According to Chaudhuri et al. (1995) the advantages of using a regression decision tree are:

+ It can easily handle model complexity in an easily interpretable way

+ Covariates values are conveyed by the tree structure

+ Statistical properties can be derived and studied

References

  • Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online].
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, 12th Edition. [VitalSource Bookshelf Online].
  • Chaudhuri, P., Lo, W. D., Loh, W. Y., & Yang, C. C. (1995). Generalized regression trees. Statistica Sinica, 641-666. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.133.4786&rep=rep1&type=pdf
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf Online].
  • McNurlin, B., Sprague, R., & Bui, T. (2008). Information Systems Management, 8th Edition. [VitalSource Bookshelf Online].
  • Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data mining and knowledge discovery2(4), 345-389. Retrieved from http://cmapspublic3.ihmc.us/rid=1MVPFT7ZQ-15Z1DTZ-14TG/Murthy%201998%20DMKD%20Automatic%20Construction%20of%20Decision%20Trees.pdf
  • Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [VitalSource Bookshelf Online].

Business Intelligence: Predictions Followup

  • Potential Opportunities:

o    Health monitoring.  Currently, smart watches are tracking our heart rate, steps, standing time, climbing stairs, siting time, heart beats, workouts, biking, sleep, etc.  But, what if we had a device that measured daily our chemicals in our blood, that is no longer as painful as pricking your finger if you are diabetic.  This, the technology could not only measure your blood chemical makeup but could send alerts to EMT and doctors if there is a dangerous imbalance of chemicals in your blood (Carter et al., 2014).  This would require a strong BI program across emergency responders, individuals, and doctors.

o    As Moore’s law of computational speed moves forward in time, the more chances are companies able to interpret real-time data and produce lead information which can drive actionable data-driven decisions. Companies can finally get answers to strategic business questions in minutes as well (Carter et al., 2014).

o    Both internal data (corporate data) and external data (competitor analysis, costumer analysis, social media, affinity and sentiment analysis), will be reported to senior leaders and executives who have the authority to make decisions on behalf of the company on a frequent basis.  These issues may show up in a dashboard, with x number of indicators/metrics as successfully implemented in a case study of a hospital (Topaloglou & Barone, 2015).

  • Potential Pitfalls:

o    Tools for threat detection, like those being piloted in New York City, could have an increased level of discrimination (Carter, Farmer, & Siegel, 2014). As big data analytics is being used to do facial recognition of photographs and live video to identify threats, it can lead to more racial profiling if the knowledge fed into the system as a priori has elements of racial profiling.  This could lead to a bias in reporting, track higher levels of a particular demographic, and the fact that past performance doesn’t indicate the future.

o    Data must be validated before it is published onto a data warehouse.  Due to the low data volatility feature of data warehouses, we need to ensure that the data we receive is correct, thus expected value thresholds must be set to capture errors before they are entered.  Wrong data in, means wrong data analysis, and wrong data-drove decisions.  An example of expected value thresholds could be that earth’s temperature cannot exceed 500K at the surface.

o    Amplified customer experience.  As BI incorporates social media to gauge what is going on in the minds of their customer, if something were to go viral that could hurt the company, it can be devastating for the company.  Essentially we are giving the customer an amplified voice.  This can be rumors of software, hardware leaks as what happens for every Apple iPhone generation/release, which can put current proprietary information into the hands of their competitors.  A nasty comment or post that gets out of control on a social media platform, to celebrity boycotts.  Though, the opportunity here lies in receiving key information on how to improve their products, identify leakers of information, and settle nasty rumors, issues, or comments.

  • Potential Threats:

o    Loss of data through hackers, which are aiming to steal someone’s identity.  Firewalls must be tighter than ever, and networks must be more secure than ever as a company goes into a centralized data warehouse.  Data warehouses are vital for BI initiatives, but if HR data is located in the warehouse, (for example to help HR identify likelihood measures of disgruntled employees to aid in their retention efforts) then if a hacker were to get a hold of that data, thousands of people information can be compromised.  This is nothing new, but this is a potential threat that must be mitigated as we proceed into BI systems.  This can not only apply to people data but company proprietary data.

o    Consumer advertisement blitz. If companies use BI to blast their customers with ads in hopes to better market to people and use item affinity analysis, to send coupons and attract more sales and higher revenues.  There is a personal example here for me:  XYZ is a clothing store, when I moved to my first house, the old owner never switched their information in their database.  But, since they were a frequent buyer and those magazines, coupons, flyers, and sales were working on the old owner of the house, they kept getting blasted with marketing ads.  When I moved in, I got a magazine every two days.  It was a waste of paper and made me less likely to shop there.  Eventually, I had enough and called customer service.  They resolved the issue, but it took six weeks after that call, for my address to be removed from their marketing and customer database.  I haven’t shopped there since.

o    Informational overload.  As companies go forward into implementing BI systems, they must meet with the entire multi-level organization to find out their data needs.  Just because we have the data, doesn’t mean we should display it.  The goal is to find the right amount of key success factors, key performance indicators, and metrics, to help out the decision makers at all different levels.  Complicating this part up can compromise the adoption of BI in the organization and will be seen as a waste of money rather than a tool that could help them in today’s competitive market.  This is such a hard line to walk on, but it is one of the biggest threats.  It was realized in the hospital case study (Topaloglou & Barone, 2015) and therefore mitigated for through extensive planning, buy-in, and documentation.

 

Resources:

Business Intelligence: OLAP

Within a Business Intelligence (BI) program online analytical processing (OLAP) and customer relationship management (CRMs) are both applications have strategic uses for the company and are dependent on the data warehouse to help analyze multidimensional datasets stored in them to provide data-driven solutions to queries. They are both systems that require data analytics to turn all the multidimensional data into insightful information. OLAP’s multidimensional view of the data warehouse data sets can occur because it is mapped onto n-dimensional data cubes, where data can then be easily rolled up, drilled down, slice and dice, and pivot (Conolly & Begg, 2014). OLAP can have many applications outside of customer relationships.  Thus, OLAP is more versatile compared to CRMS, because CRMs are more targeted/focused with their approach, analysis of the customer relationship to the company/product.  CRMs main goal is to analyze internal and external data stored in the data warehouse, to come up with insights like “predicted affinity to buy” of a customer, the “cost or profit” of a customer, “prediction of future customer behavior”, etc. (Ahlemeyer-Stubbe & Shirley, 2014).  The information gained from the CRM can empower employees at the company on a customer’s affinity towards a product to either sell similar items or items of the result in a market basket analysis.

OLAP is the online analytical processing application, which allows people to examine data in real time from different points of view in aid to driving more data-driven decisions (McNurlin et al., 2008).  With OLAP, computers can now make what-if analysis and goal-based decisions using data. The key ability of OLAPs systems are to help answer the “Why?” question, as well as the typical “Who?” and “What?” questions (Conolly & Begg, 2014).  Connolly and Begg (2014) further explain that OLAP is a specialized implementation of SQL. Unfortunately, data queried is assumed to be static and unchanging.  Hence, the low volatile aspect of a data warehouse, with multidimensional databases is ideal for OLAP apps.  They value of the data warehouse does not come from just storing the right kind of data, but through making and conducting analysis, to solve queries that will in the end help make data driven decisions that are the best for the company.  According to Conolly & Begg (2014), OLAP tools have been used in studying the effectiveness of marking campaigns, product sales forecasting, and capacity planning.  However, it is of the opinion of Conolly & Begg (2014) that data mining tools can surpass the capabilities of OLAP tools.

CRMs, on the other hand, focuses a wide range of concepts revolving how companies store, capture and analyze customer, vendor, and partner relationship data. Information stored in CRMs could be interactions with customers, vendors or partners, which allow the company to gain insights based on previous interactions and could even be grouped/associated into different customer segments, market basket analysis, etc. (Ahlemeyer-Stubbe & Shirley, 2014). CRMs can assist in real time with making data-driven decisions with respects to a company’s customers (Mcnurlin, Sprague, & Bui, 2008).  The goal is to use the current data, to help the company build more optimal communications and relationships with it customers, vendors or partners.  Both internal and external data of the company is usually added to the data warehouse for the CRM. Through the use of the internet, companies can study more about their customers and their noncustomers, to aid a company to become more customer centric (McNurlin et al., 2008).  McNurlin et al. (2008) stated a case study with Wachovia Bank purchasing a pay-by-use CRM system from salesforce.com.  After the system was set up within six weeks, sales reps had 30 more hours to use on selling more bank services, and managers can use the data that was collected by the CRM to tell the sales reps which customers would have the highest yield.

References:

Business Intelligence: Corporate Planning

Corporate Planning

The main difference between business planning and corporate planning is the actors.  They both are defining strategies that will help the meet the business goals and objectives.  However, business planning is describing how the business will do it, through focusing on business operations, marketing, and products and services (Smith, n.d).  Meanwhile, corporate planning is describing how the employees will do it, through focusing on staff responsibilities and procedures (Smith, n.d.).  Smith (n.d.) implied that corporate planning will succeed if it is aligned with the company’s strategy and missions, drawing on the strengths and improving on its weaknesses. A successful and realistic corporate and business plan can help the company succeed.  Business Intelligence can help in creating these plans.  In order to make the right plans, we must make better decisions that help the company out, and data-driven decisions (through Business Intelligence).  Business Intelligence, will help provide answers to questions much faster and quite easily, make better use of the corporate time, and finally aid in making improvements for the future (Carter, Farmer, & Siegel, 2014).

A small, medium, or large organization deals with planning differently, so BI solutions are not a one-size-fits-all.  Small companies have the freedom, creativity, motivation, and flexibility that large companies lack (McNurlin, Sprague, & Bui, 2008).  Large companies have the economies of scales and knowledge that small companies do not (McNurlin et al., 2008).  Large companies are beginning to advocate centralized corporate planning yet decentralized execution, which is a similar structure of a medium size company (McNurlin et al., 2008).  Thus, medium size companies have the benefits of both large and small companies, but also the disadvantages of both.  Unfortunately, a huge drawback on large organizations is a fear of collaboration and tightly holding onto their proprietary information (Carter et al., 2014). The issues of holding tightly to proprietary information and lack of collaboration is not conducive for a solid Knowledge Management nor Business Intelligence plan.

Business Intelligence

Business Intelligence uses data to create information that helps with data-driven decisions, which can be especially important for corporate planning.  Thus, we can reap the benefits of Business Intelligence to make data-driven decisions, if we balance the needs of the company, corporate vision, and the size of the company to help in choosing which models the company should use.  A centralized model is when one team in the entire corporation owns all the data and provides all the needed analytical services (Minelli, Chambers, & Dhiraj, 2013).  A decentralized model of Business Intelligence is where each business function owns its data infrastructure and a team of data scientists (Minelli et al., 2013).  Finally, Minelli et al. (2013) defined that a federated model is where each function is allowed to access the data to make data-driven decisions, but also ensures that it is aligned to a centralized data infrastructure.

Knowledge Management

McNurlin et al. (2008), defines knowledge management as managing the transition between two states of knowledge, tacit (information that is privately kept in one’s mind) and explicit knowledge (information that is made public, which is articulated and codified). We need to discover the key people who have the key knowledge, which will aid in knowledge sharing to help benefit the company.  Knowledge management can rely on technology to be captured and share appropriately such that it can be used to sustain the individual and sustain the business performance (McNurlin et al., 2008).

Knowledge management can also include domain knowledge (knowledge of a particular field or subject).  The inclusion of domain knowledge into a data mining, which is a component of Business Intelligence System has aided in pruning association rules to help extract meaningful data to aid in developing data-driven decisions (Cristina, Garcia, Ferraz, & Vivacqua, 2009).  In this particular study, engineers helped to build a domain understanding to interpret the results as well as steer the search of specific if-then rules, which helped to find more significant patterns in the data (Cristina et al. 2009).

The addition of domain experts helped captured tacit knowledge and transformed it into explicit knowledge, which was then used to find significant patterns in the data that was collected and mined through.  This eventually leads to a more manageable set of information with high significance to the company to which data-driven decisions can be made to support the corporate planning. Thus, knowledge management can be an integral part of Business Intelligence.  Finally, Business Intelligence uses data to create information that when introduced with experience of the employees (through knowledge management) it can then create explicit knowledge, which can provide more meaningful data-driven decisions than if one were to focus on a Business Intelligence Systems alone.

The effectiveness of capturing and adding domain knowledge into a company’s Business Intelligence System depends on the quality of employees in the company and their willingness to share that knowledge.  At the end of the day, a corporate plan that focuses on staff responsibilities and procedures revolving both in Business Intelligence and Knowledge Management will gain more insights and a higher return on investment that will eventually feed back into the corporate and business plans.

References

  • Carter, K. B., Farmer, D., & Siegel C., (2014). Actionable Intelligence: A Guide to Delivering Business Results with Big Data Fast!. John Wiley & Sons P&T. VitalBook file.
  • Cristina, A., Garcia, B., Ferraz, I., & Vivacqua, A. S. (2009). From data to knowledge mining. http://doi.org/10.1017/S089006040900016X
  • McNurlin, B., Sprague, R., Bui, T. (2008). Information Systems Management, 8th Edition. Pearson Learning Solutions. VitalBook file.
  • Minelli, M., Chambers, M., and Dhiraj A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.
  • Smith, C. (n.d.) The difference between business planning and corporate planning. Small Business Chron. Retrieved from http://smallbusiness.chron.com/differences-between-business-planning-corporate-planning-882.html

Business Intelligence: Data Warehouse

A data warehouse is a central database, which contains a collection of decision-related internal and external sources of data for analysis that is used for the entire company (Ahlemeyer-Stubbe & Coleman, 2014). The authors state that there are four main features to data warehouse content:

  • Topic Orientation – data which affects the decisions of a company (i.e. customer, products, payments, ads, etc.)
  • Logical Integration – the integration of company common data structures and unstructured big data that is relevant (i.e. social media data, social networks, log files, etc.)
  • Presence of Reference Period – Time is an important part of the structural component to the data because there is a need in historical data, which should be maintained for a long time
  • Low Volatility – data shouldn’t change once it is stored. However, amendments are still possible. Therefore, data shouldn’t be overridden, because this gives us additional information about our data.

Given the type of data stored in a data warehouse, it is designed to help support data-driven decisions.  Making decisions from just a gut feeling can cost millions of dollars, and degrade your service.  For continuous service improvements, decisions must be driven by data.  Your non-profit can use this data warehouse to drive priorities, to improve services that would yield short-term wins as well as long-term wins.  The question you need to be asking is “How should we be liberating key data from the esoteric systems and allowing them to help us?”

To do that you need to build a BI program.  One where key stakeholders in each of the business levels agree on the logical integration of data, common data structures, is transparent in the metrics they would like to see, who will support the data, etc.  We are looking for key stakeholders on the business level, process level and data level (Topaloglou & Barone, 2015).  The reason why, is because we need to truly understand the business and its needs, from there we can understand the current data you have, and the data you will need to start collecting.  Once the data is collected, we will prepare it before we enter it into the data warehouse, to ensure low volatility in the data, so that data modeling can be conducted reliable to enable your evaluation and data-driven decisions on how best to move forward (Padhy, Mishra, & Panigrahi,, 2012).

Another non-profit service organization that implemented a successful BI program through the creation of a data warehouse can be found by Topaloglou and Barone (2015).  This hospital experienced positive effects towards implementing their BI program:  end users can make strategic data based decisions and act on them, a shift in attitudes towards the use and usefulness of information, perception of data scientist from developers to problem solvers, data is an immediate action, continuous improvement is a byproduct of the BI system, real-time views with data details drill down features enabling more data-driven decisions and actions, the development of meaningful dashboards that support business queries, etc. (Topaloglou & Barone, 2015).

However, Topaloglou and Barone (2015) stressed multiple times in the study, which a common data structure and definition needs to be established, with defined stakeholders and accountable people to support the company’s goal based on of how the current processes are doing is key to realizing these benefits.  This key to realizing these benefits exists with a data warehouse, your centralized location of external and internal data, which will give you insights to make data-driven decisions to support your company’s goal.

Resources

Business Intelligence: Multilevel BI

Annotated Bibliography

Citation:

Curry, E., Hasan, S., & O’Riain, S. (2012, October). Enterprise energy management using a linked dataspace for energy intelligence. In Sustainable Internet and ICT for Sustainability (SustainIT), 2012 (pp. 1-6). IEEE.

Author’s Abstract:

“Energy Intelligence platforms can help organizations manage power consumption more efficiently by providing a functional view of the entire organization so that the energy consumption of business activities can be understood, changed, and reinvented to better support sustainable practices. Significant technical challenges exist in terms of information management, cross-domain data integration, leveraging real-time data, and assisting users to interpret the information to optimize energy usage. This paper presents an architectural approach to overcome these challenges using a Dataspace, Linked Data, and Complex Event Processing. The paper describes the fundamentals of the approach and demonstrates it within an Enterprise Energy Observatory.”

 

My Personal Summary:

Using BI as a foundation, a linked (key data is connected to each other to provide information and knowledge) dataspace (a huge data mart with data that is related to each other when needed) for energy intelligence was implemented for the Digital Enterprise Research Institute (DERI), which has ~130 staff located in one building.  The program was trying to measure the direct (electricity costs for data centers, lights, monitors, etc.) and indirect (cost of fuel burned, the cost of gas used by commuting staff) energy usage of the enterprise to become a more sustainable company (as climate change is a big topic these days).  It covered that a multi-level and holistic view of the business intelligence (on energy usage) was needed.  They talked about each of the individual types of information conveyed at each level.

My Personal Assessment:

However, this paper didn’t go into how effective was the implementation of this system.  What would have improved this paper, is saying something about the decrease in the CO2 emission DERI had over the past year.  They could have graphed a time series chart showing power consumption before implementation of this multi-level BI system and after.  This paper was objective but didn’t have any slant as to why we should implement a similar system.  They state that their future work is to provide more granularity in their levels, but nothing on what business value it has had on the company.  Thus, with no figures stating the value of this system, this paper seemed more like a conceptual, how-to manual.

My Personal Reflection:

This paper doesn’t fit well into my research topic.  But, it was helpful in defining a data space and multi-level and holistic BI system.  I may use the conceptual methodology of a data space in my methodology, where I collect secondary data from the National Hurricane Center into a big data warehouse and link the data as it seems relevant.  This, should save me time, and reduce labor intensive costs to data integration due to postponing it when they are required.  It has changed my appreciation of data science, as there is another philosophy to just bringing in one data set at a time into a data warehouse and make all your connections, before moving on to the next data set.

A multilevel business intelligence setup and how it affects the framework of an organization’s decision-making processes. 

In Curry et al. (2012), they applied a linked data space BI system to a holistic and multi-level organization.  Holistic aspects of their BI system included Enterprise Resource Planning, finance, facility management, human resources, asset management and code compliance.  From a holistic standpoint, most of these groups had silo information that made it difficult to leverage across their domains.  However, this is different than multi-level BI system setup.  Defined in Table II in Curry et al (2012), in the multi-level set up, the data gets shown to the organization (stakeholders are executive members, shareholders, regulators, suppliers, consumers), functional (stakeholders are functional managers, organization manager), and individual level (stakeholders are the employees).  Each of these stakeholders has different information requirements and different levels of access to certain types of data. Thus, the multi-level BI system must take this into account.  Thus, different information requirements and access mean different energy metrics, i.e. Organizational Level Metrics could be Total Energy Consumption, % Renewable energy sources, versus Individual Level Metrics could be Business Travel, Individual IT consumption, Laptop electricity consumption, etc.  It wouldn’t make sense that an executive or a stake holder to look at every 130 staff members Laptop electricity consumption metric when they could get a company-wide figure.   However, the authors did note that the level organization data can be further drilled down, to see where the cause could be for a particular event in question.  Certain data that the executives can see will not be accessed by all individual employees. Thus, a multi-level BI system also addresses this.  Also, employee A cannot view employee B’s energy consumption because of lateral level view of the BI system data may not be permissible.

Each of the different levels of metrics reported out by this multi-level BI system allows that particular level to make data-driven decisions to reduce their carbon footprint.  An executive can look at the organizational level metrics, and institute a power down your monitors at night initiative to save power corporate wide.  But, at the individual level, they could choose to leave to go to work earlier, not to be in traffic too long and waste less gas, thus reducing their indirect carbon footprint for the company.  Managers can make decisions to a request for funding for energy efficient monitors and laptops for all their teams, or even a single power strip per person, to reduce their teams’ energy consumption cost, which is based on the level of metrics they can view.