Adv DB: Indexes for query optimization

Information sought in a database can be extracted through a query.  However, the bigger the database, the slower the processing time it would take for a query to go through, hence query optimization techniques are conducted.  Another reason for optimization can occur with complex queries operations.

Rarely see that an index is applied on every column in every table

Using indices for query optimization is like using the index at the back of the book to help you find the information/topic you need quickly. You could always scan all the tables just like you can read the entire book, but that is not efficient (Nevarez, 2010).  You can use an index seek (ProductID = 77) or an index scan via adding an operand (ABS(ProductID) = 77), though a scan takes up more resources than a seek.  You can combine them (ProductID = 77 AND ABS(SalesOrderID) = 12345), where you would seek via ProductID and scan for SalesOrderID.  Indexing can be seen as an effective way to optimize your query, besides other methods like applying heuristic rules or ordering the query operations for efficient use of resources (Connolly & Begg, 2014).  However, indices not being used have no use to us, as they take up space on our system (Nevarez, 2010) which can slow down your operations.  Thus, they should be removed.  That is why indexing shouldn’t be applied to every column in every table.  Indexing in every column may not be necessary because it can also depend on the size of the table, indexing is not needed if the table is 3*4, but may be needed if a table is 30,000 * 12.

Thoughts on how to best manage data files in a database management system (DBMS)

Never assume, verify any changes you make with cold hard data. When considering how best to manage a database one must first learn if the data files or the data within the database are dynamic (users create, insert, update, delete regularly) or static (changes are minimal to non-existant) (Connolly & Begg, 2014).  Database administrators need to know when to fine-tune their databases with useful indices on tables that are widely used and turn off those that are not used at all.  Turning off those that are not used at all will saving space, optimize updated functions, and improving resource utilization (Nevarez, 2010). Knowing this will help us understand the nature of the database user. We can then re-write queries that are optimized via correct ordering of operations, removing unnecessary loops and do joins instead, how join, right join or left join properly, avoiding the wildcard (*) and call on data you need, and ensure proper use of internal temporary tables (those created on a server while querying).  Also, when timing queries, make sure to test the first run against itself and avoid the accidental time calculation which includes data stored in the cache. Also, caching your results, using the cache in your system when processing queries is ideal.  A disadvantage of creating too many tables in the same database is slower interaction times, so creating multiple databases with fewer tables (as best logic permits) may be a great way to help with caching your results (MySQL 5.5 Manual, 2004).

Resources

Adv Database Management: SQL Unions

Please note that the following blog post provides a summary view for what you need to get done (left column) and quick examples that illustrate how to do it in SQL (right column with SQL code in red). For more information please see the resources below:

Union
SELECT ename, job, deptno
  FROM emp
  UNION
SELECT name, title, deptid
  FROM emp_history
Intersect
SELECT ename, job, emptno
  FROM emp
  INTERSECT
SELECT name, emptid, title
  FROM emp_history
Union all (will include duplicate values)
SELECT ename, job, emptno
  FROM emp
  UNION ALL
SELECT name, emptid, title
  FROM emp_history

Adv Database Management: SQL Sub-queries and views

Please note that the following blog post provides a summary view for what you need to get done (left column) and quick examples that illustrate how to do it in SQL (right column with SQL code in red). For more information please see the resources below:

Subquery
SELECT enames
  FROM emp
  WHERE sal >
    (SELECT sal
     FROM emp
     WHERE empno = 7566)
Correlated Subqueries
SELECT empno, sal, deptno
  FROM   emp outr
  WHERE  sal >
    (SELECT AVG(sal)
     FROM   emp innr
     WHERE  outr.deptno = innr.deptno)
Exists
SELECT empno, ename, job, deptno
  FROM   emp outr
  WHERE  EXISTS
    (SELECT empno
     FROM   emp innr
     WHERE  innr.mgr = outr.empno)
Not Exists
SELECT dname, deptno
  FROM   dept d
  WHERE  NOT EXISTS
    (SELECT *
     FROM   emp e
     WHERE  d.deptno = e.deptno)
In
SELECT empno, ename, job, deptno
  FROM   emp outr
  WHERE empno IN
    (SELECT mgr
     FROM   emp)

Creating a view
CREATE VIEW empvu10
  AS SELECT empno, ename, job
     FROM emp
     WHERE deptno = 10
Drop view
DROP VIEW empvu10

Adv Database Management: SQL Group functions

Please note that the following blog post provides a summary view for what you need to get done (left column) and quick examples that illustrate how to do it in SQL (right column with SQL code in red). For more information please see the resources below:

AVG, COUNT, MAX, MIN, STDDEV, SUM, VARIANCE
SELECT AVG(sal), MAX(sal), MIN(sal), SUM(sal)
  FROM emp
  WHERE jobs LIKE ‘Sales%’
COUNT
SELECT COUNT(*)
  FROM emp
  WHERE deptno = 30
Group By
SELECT deptno, AVG(sal)
  FROM emp
  GROUP BY deptno
Rollup and cube
SELECT   deptno, MAX(sal)
  FROM     emp
  GROUP BY deptno WITH ROLLUP [CUBE]
Having
SELECT   deptno, MAX(sal)
  FROM     emp
  GROUP BY deptno
  HAVING max(sal)>2900

 

Database Management: SQL Joins

Please note that the following blog post provides a summary view for what you need to get done (left column) and quick examples that illustrate how to do it in SQL (right column with SQL code in red). For more information please see the resources below:

Equijoins
SELECT e.ename, e.deptno,  d.deptno, d.name
  FROM emp e INNER JOIN dept d
  ON e.deptno = d.deptno
Non-Equijoins
SELECT e.ename, e.sal,  s.grade
  FROM emp e INNER JOIN salgrade s
  WHERE e.sal
  BETWEEN  s.losal  AND  s.hisal

From:
grade      losal        hisal
-----      -----        ------
1            700        1200
2           1201        1400
3           1401        2000
4           2001        3000
5           3001        9999

Gives the following solution:
ename           sal     grade
----------   --------- ---------
JAMES            950         1
SMITH            800         1
ADAMS           1100         1
Outer joins
SELECT e.ename, e.deptno,  d.deptno
  FROM emp e RIGHT JOIN dept d
  ON e.deptno = d.deptno

SELECT e.deptno,  d.deptno, d.name
  FROM emp e LEFT JOIN dept d
  ON e.deptno = d.deptno
Self Joins
SELECT worker.ename +’ works for’+ manager.ename
  FROM emp worker, emp manger
  ON worker.mgr = manager.empno

 

Database Management: SQL Select

Please note that the following blog post provides a summary view for what you need to get done (left column) and quick examples that illustrate how to do it in SQL (right column with SQL code in red). For more information please see the resources below:

Select
SELECT *
  FROM dept
Parentheses
SELECT ename, sal, 12*(sal+100)
  FROM emp
Column Alias
SELECT ename, AS name, sal salary
  FROM emp
Eliminating Duplicate Rows
SELECT DISTINCT deptno
  FROM emp
Selecting with WHERE
SELECT ename, job, deptno
  FROM emp
  WHERE job=’Clerk’
Comparison Operators in WHERE
SELECT ename, sal, comm
  FROM emp
  WHERE sal<=comm
Between Operators in WHERE
SELECT ename, sal, comm
  FROM emp
  WHERE sal BETWEEN 1000 AND 1500
Test Values in a list
SELECT ename, empno, mgr
  FROM emp
  WHERE mgr IN (7902, 7566, 7788)
Pattern Matching in WHERE

_ is one space character *

% is trailing characters *

* is all characters

SELECT ename
  FROM emp
  WHERE ename LIKE ‘_A%’
Both conditions in WHERE are true
SELECT ename, empno, job, sal
  FROM emp
  WHERE sal>=1100
  AND job=’Clerk’
One or the other in WHERE are true
SELECT ename, empno, job, sal
  FROM emp
  WHERE sal>=1100
  OR job=’Clerk’
Not Operator in WHERE
SELECT ename, empno, job, sal
  FROM emp
  WHERE job NOT IN (’Clerk’, ’Manager’, ’Analyst,)
Sort rows
SELECT ename, job, deptno, hiredate
  FROM emp
  ORDER BY hiredate
Descending order sorting of rows
SELECT ename, job, deptno, hiredate
  FROM emp
  ORDER BY hiredate DESC
Table aliases
SELECT e.empno, e.ename, e.deptno,  d.deptno, d.loc
  FROM emp e, dept d
  WHERE e.deptno = d.deptno

 

Database Management: SQL Basics

Please note that the following blog post provides a summary view for what you need to get done (left column) and quick examples that illustrate how to do it in SQL (right column with SQL code in red). For more information please see the resources below:

Create table
CREATE TABLE dept
  (deptno NUMERIC(2),
   dname VARCHAR(14),
   loc VARCHAR(14))
Add Column
ALTER TABLE dept
  ADD job VARCHAR (9);
Rename Table
RENAME dept TO department
Rename Column
ALTER TABLE dept ALTER COLUMN job RENAME TO career
Delete Table
DROP TABLE dept
Not Null in a table (forcing there to be a value in that variable in the table)
CREATE TABLE dept
  (deptno NUMERIC(2) NOT NULL,
   dname VARCHAR(14),
   loc VARCHAR(14))
Unique (allows for null values, but doesn’t allow for the same variable to be repeated)
CREATE TABLE dept
  (deptno NUMERIC(2),
   dname VARCHAR(14),
   loc VARCHAR(14),
   CONSTRAINT dept_dname_uk UNIQUE (dname))
Primary Key (doesn’t allow for null values)
CREATE TABLE dept
  (deptno NUMERIC(2),
    dname VARCHAR(14),
    loc VARCHAR(14),
    CONSTRAINT dept_dname_uk UNIQUE (dname),
    CONSTRAINT dept_deptno_pk PRIMARY KEY(deptno))
Foreign Keys (connect data with other tables)
CREATE TABLE emp
  (empno NUMERIC(4),
   ename VARCHAR(10) NOT NULL,
   job VARCHAR(9),
   mgr NUMERIC(4),
   hiredate DATETIME,
   sal MONEY,
   comm MONEY,
   deptno NUMERIC(2) NOT NULL,
   CONSTRAINT emp_deptno_fk FOREIGN KEY (deptno) REFERENCES dept (deptno))
)
Check Constraints (sanity checks)
CREATE TABLE Dept
  (deptno NUMERIC(2),
    dname VARCHAR(14),
    loc VARCHAR(14),
    CONSTRAINT dept_dname_uk UNIQUE (dname),
    CONSTRAINT dept_deptno_pk PRIMARY KEY(deptno),
    CONSTRAINT emp_deptno_ck CHECK (deptno BETWEEN 10 AND 99))
Insert new rows of data
INSERT INTO dept (deptno, dname, loc)
  VALUES (50, ‘development’, ‘Detroit’)
Change data within the table
UPDATE emp
  SET deptno = 20
  WHERE empno = 7782
Removing a row form a table
DELETE FROM dept
  WHERE dname = ‘development’
Removing data entry in all tables (CAUTION)
DELETE CASCADE FROM dept
  WHERE dname = ‘development’

NOTE:

  • When doing delete stuff, do a SELECT statement with the WHERE clause to make sure you are not shooting in the dark.

 

Data Tools: Artificial Intelligence

Big data Analytics and Artificial Intelligence

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). Though previously, AI wasn’t able to come into existence without the proper computational power that is provided today (Cringely, 2013).  AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with (Power, 2015).  The goal of AI is to use huge amounts of data to draw out a set of rules through machine learning that will effectively replace experts in a certain field (Cringely, 2013; Power, 2015). Cringely (2013) stated that in some situations big data can eliminate the need for theory and that AI can aid in analyzing big data where theory is either lacking or impossible to define.

AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  The thing that has slowed down the progression of AI in the past was the creation of human readable computer languages like XML or SQL, which is not intuitive for computers to read (Cringely, 2013).  Fortunately, AI can easily use structured data and now use unstructured data thanks to everyone who tags all these unstructured data either in comments or on the data point itself, speeding up the computational time (Cringely, 2013; Power, 2015).  Dewey (2013), hypothesized that not only will AI be able to analyze big data at speeds faster than any human can, but that the AI system can also begin to improve its search algorithms in phenomena called intelligence explosion.  Intelligence explosion is when an AI system begins to analyze itself to improve itself in an iterative process to a point where there is an exponential growth in improvement (Dewey, 2013).

Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).  It would take many scientists to analyze the same big data and analyze it all, to fully understand how the connections were made in the AI system, which is no longer feasible (Cringely, 2013).  It is as if data scientist is trying to read the mind of the AI system, and they currently cannot read a human’s mind. However, the results of AI are becoming accurate, with AI identifying cats in photographs in 72 hours of machine learning and after a cat is tagged in a few photographs (Cringely, 2013). AI could be applied to any field of study like finance, social science, science, engineering, etc. or even play against champions on the Jeopardy game show (Cyranoski, 2015; Cringely, 2013; Dewey, 2013; Power, 2015).

Example of artificial intelligence use in big data analysis: Genomics

The goal of AI use on genomic data is to help analyze physiological traits and lifestyle choices to provide a dedicated and personalized health plan to treat and eventually prevent disease (Cyranoski, 2015; Power, 2015).  This is done by feeding the AI systems with huge amounts of genomic data, which is considered big data by today’s standards (Cyranoski, 2015). Systems like IBM’s Watson (an AI system) could provide treatment options based on the results gained from analyzing thousands or even millions of genomic data (Power, 2015).  This is done by analyzing all this data and allowing machine learning techniques to devise algorithms based on the input data (Cringely, 2013; Cyranoski, 2015; Power, 2015).  As of 2015, there is about 100,000 individual genomic data in the system, and even with this huge amounts of data, it is still not enough to provide the personalized health plan that is currently being envisioned based on a person’s genomic data (Cyranoski, 2015).  Eventually, millions of individuals will need to be added into the AI system, and not just genomic data, but also proteomics, metabolomics, lipidomics, etc.

Resources: