Adv DB: Indexes for query optimization

Information sought in a database can be extracted through a query.  However, the bigger the database, the slower the processing time it would take for a query to go through, hence query optimization techniques are conducted.  Another reason for optimization can occur with complex queries operations.

Rarely see that an index is applied on every column in every table

Using indices for query optimization is like using the index at the back of the book to help you find the information/topic you need quickly. You could always scan all the tables just like you can read the entire book, but that is not efficient (Nevarez, 2010).  You can use an index seek (ProductID = 77) or an index scan via adding an operand (ABS(ProductID) = 77), though a scan takes up more resources than a seek.  You can combine them (ProductID = 77 AND ABS(SalesOrderID) = 12345), where you would seek via ProductID and scan for SalesOrderID.  Indexing can be seen as an effective way to optimize your query, besides other methods like applying heuristic rules or ordering the query operations for efficient use of resources (Connolly & Begg, 2014).  However, indices not being used have no use to us, as they take up space on our system (Nevarez, 2010) which can slow down your operations.  Thus, they should be removed.  That is why indexing shouldn’t be applied to every column in every table.  Indexing in every column may not be necessary because it can also depend on the size of the table, indexing is not needed if the table is 3*4, but may be needed if a table is 30,000 * 12.

Thoughts on how to best manage data files in a database management system (DBMS)

Never assume, verify any changes you make with cold hard data. When considering how best to manage a database one must first learn if the data files or the data within the database are dynamic (users create, insert, update, delete regularly) or static (changes are minimal to non-existant) (Connolly & Begg, 2014).  Database administrators need to know when to fine-tune their databases with useful indices on tables that are widely used and turn off those that are not used at all.  Turning off those that are not used at all will saving space, optimize updated functions, and improving resource utilization (Nevarez, 2010). Knowing this will help us understand the nature of the database user. We can then re-write queries that are optimized via correct ordering of operations, removing unnecessary loops and do joins instead, how join, right join or left join properly, avoiding the wildcard (*) and call on data you need, and ensure proper use of internal temporary tables (those created on a server while querying).  Also, when timing queries, make sure to test the first run against itself and avoid the accidental time calculation which includes data stored in the cache. Also, caching your results, using the cache in your system when processing queries is ideal.  A disadvantage of creating too many tables in the same database is slower interaction times, so creating multiple databases with fewer tables (as best logic permits) may be a great way to help with caching your results (MySQL 5.5 Manual, 2004).


Adv DB: Transaction management and concurrency control

Transaction management, in a nutshell, is keeping track (serialized or scheduled) changes made to a database.  An overly simplistic example is debiting and crediting $100 and $110 dollars (respectively).  If the account balance is currently at $90, the order of this transaction is vital to avoid overdraft fees.  Now, concurrency control is used to ensure data integrity when a transaction occurs.  Thus making the two events interconnected.  Thus, in our example, serializing the transaction (all actions are done consecutively) is key.  You want to add the $110 dollars first so you have $200 in the account to then debit $100.  To do this you will need a timestamp ordering/serialization.  This became a terrible issue back in 2010 and is still an issue in 2014 (Kristof), where a survey of 44 major banks in which, half still re-order the transactions, which can result in draining account balances and causing overdraft fees.  The way they get around all of this is usually having processing times for deposits, which are typically longer than the processing times for charges.  Thus, even if done correctly serially, the processing time can per transaction vary so significantly that these issues happen.  According to Kristof (2014), banks say they do this to process payments in order of priority.

In the case above, it illustrates why this is why an optimistic concurrency control method is not helpful.  It is not helpful because they don’t check for serialization when doing the transactions initially (causing high cost on resources).  However, transactions in optimistic situations are done locally and validated against serialization before finalizing.  Here, if we started at the first of the month and paid a bunch of bills and then realized we were close to $0 so we deposited $110 and continued paying bills to the sum of $100, this can eat up a lot of processing time.  Thus it can get quite complicated quite quickly.  Conservative concurrency controls have the fewest number of abort and eliminates waste in processing via doing things in a serial nature, but you cannot run things in a parallel manner.

Huge amounts of data coming in like those from the internet of things (where databases need to be flexible and extensive because a projected trillion of different items would be producing data) would benefit greatly from the optimistic concurrency control.  Take the example of a Fitbit/Apple watch/Microsoft band.  It records data on you throughout the day.  However, the massive data is time-stamped and heterogeneous, it doesn’t matter if the data for sleep and walking are processed in parallel, but in the end, it is still validated.  This allows for a faster upload time through blue tooth and/or wifi environments.  Data can be actively extracted and explained in real-time, but when there are many sensors on the device, the data and sensors all have different forms of reasoning rules and semantic links between data, where existing or deductive links between sources exist (Sun & Jara, 2014) and that is where the true meaning of the generated data lies.  Sun & Jara suggests that a solid mathematics basis will help in ensuring correct and efficient data storage system and model.