Adv DBs: Unsupervised and Supervised Learning

Unsupervised and Supervised Learning:

Supervised learning is a type of machine learning that takes a given set of data points, we need to choose a function that gives users a classification or a value.  So, eventually, you will get data points that no longer defines a classification or a value, thus the machine now has to solve for that function. There are two main types of supervised learning: Classification (has a finite set, i.e. based on person’s chromosomes in a database, their biological gender is either male or female) and Regression (represents real numbers in the real space or n-dimensional real space).  In regression, you can have a 2-dimensional real space, with training data that gives you a regression formula with a Pearson’s correlation number r, given a new data point, can the machine use the regression formula with correlation r to predict where that data point will fall on in the 2-dimensional real space (Mathematicalmonk’s channel, 2011a).

Unsupervised learning aims to uncover homogenous subpopulations in databases (Connolly & Begg, 2015). In Unsupervised learning you are given data points (values, documents, strings, etc.) in n-dimensional real space, the machine will look for patterns through either clustering, density estimation, dimensional reduction, etc.  For clustering, one could take the data points and placing them in bins with common properties, sometimes unknown to the end-user due to the vast size of the data within the database.  With density estimation, the machine is fed a set of probability density functions to fit the data and it begins to estimates the density of that data set.  Finally, for dimensional reduction, the machine will find some lower dimensional space in which the data can be represented (Mathematicalmonk’s channel, 2011b).  With the dimensional reduction, it can destroy the structure that can be seen in the higher-order dimensions.

Applications suited to each method

  • Supervised: defining data transformations (Kelvin to Celsius, meters per second to miles per hour, classifying a biological male or female given the number of chromosomes, etc.), predicting weather (given the initial & boundary conditions, plug them into formulas that predict what will happen in the next time step).
  • Unsupervised: forecasting stock markets (through patterns identified in text mining news articles, or sentiment analysis), reducing demographical database data to common features that can easily describe why a certain population will fit a result over another (dimensional reduction), cloud classification dynamical weather models (weather models that use stochastic approximations, Monte Carlo simulations, or probability densities to generate cloud properties per grid point), finally real-time automated conversation translators (either spoken or closed captions).

Most important issues related to each method

Unsupervised machine learning is at the bedrock of big data analysis.  We could use training data (a set of predefined data that is representative of the real data in all its n-dimensions) to fine-tune the most unsupervised machine learning efforts to reduce error rates (Barak & Modarres, 2015). What I like most about unsupervised machine learning is its clustering and dimensional reduction capabilities, because it can quickly show me what is important about my big data set, without huge amounts of coding and testing on my end.

References: