What makes big data different from conventional data that you use every day?
The differentiation exists where big data and conventional deals with data storage and data analysis. Big data is complex, challenging, and significant (Ward & Barker, 2013). Ward and Barker (2013) traced back the definition of Volume, Velocity, and Variety from Gartner. They then compare its definition to Oracle’s, which is data to mean the value derived from merging relational database with unstructured data that can vary in size, structure, format, etc. Finally, the authors state that Intel big data definition is a company generating about 300 TB weekly, and typically it can come from transactions, documents, emails, sensor data, social media, etc. They use all of this information to state that the true definition should lie with the size of the data, a complexity of the data, and the technologies used to analyze the data. This is how you can differentiate it from conventional data.
Davenport, Barth, and Bean (2012), stated that IT companies define big data as “more insightful data analysis”, but if used properly companies can gain a competitive edge. Companies that use big data: are aware of data flows (customer-facing data, continuous process data, network relationships, which is dynamic and always changing in a continuous flow), rely on data scientists (upgraded data management skill, programing, math, stats, business acumen, and effective communication) and move away from IT functions (concerned with automation) into ops or prod functions (since its goals is to present information to the business first). Data in a continuous flow needs to have business processes set up for obtaining/gathering/capturing, storing, extracting, filtering, manipulating, structuring, monitoring, analyzing and interpreting, to help facilitate data-driven decisions.
Finally, Lazer, Kennedy, King, and Vespignani (2014), talked about big data hubris, where the assumption that big data can do it all and is a great substitute for conventional data analysis. They state that errors in measurement, validity, reliability and dependencies in the data cannot be ignored. Big data analysis can overfit its analysis to a small number of cases. Greater value to any big dataset is to marry it with other near-real-time data from different sources, but continuous evaluation and improvement should always be incorporated. Sources of errors in analysis can arise from measurement (is it stable and comparable across cases and over time, are there systematic errors), algorithm dynamics, search algorithms, and changes in the data-generating process. The authors finally state that transparency and replicability of data analysis (especially secondary or aggregate data, since there are fewer privacy concerns in that), could help improve the results of big data analysis. Without transparency and replicability, how will other scientist learn and build on the knowledge (thus destroying the accumulation of knowledge)?
There is a difference between big data and conventional data. But, no matter how big, fast, and different the data sets are, one cannot deny that because of big data, conventional data gathering, analysis, and techniques are not influenced either. Improvements have been made, to allow doctoral students to conduct surveys at a much faster rate, gather more unstructured data through interview processes, and transcription software used for audio files in big data can also be used in smaller conventional data. Though vastly different, and can come with their errors as we improve one, we inadvertently improve the other.
Public Sites that provide free access to big data sets:
- Davenport, T. H., Barth, P., & Bean, R. (2012). How big data is different. MIT Sloan Management Review, 54(1), 43.
- Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google Flu: Traps in big data analysis. Science, 343(14 March).
- Ward, J. S., & Barker, A. (2013). Undefined by data: a survey of big data definitions. arXiv preprint arXiv:1309.5821.