What is Big Data?
By Ganapathi Devappa
We have been hearing a lot about Big Data. When I tell people that we provide Big Data solutions, lot of them ask me what is this Big Data? Though every one has some notion of what it is they are not really sure. Here I try to explain what is really meant by Big Data, though there is no clear definition and there are lot of blogs and books on the subject.
Big Data Origins
Universe has been accumulating data since its origin (Not sure if there was such a point as origin). Even before life came on earth, earth has been accumulating data in its many layers of rocks and soil. After life came on earth, further data was created through fossils and lot of data was destroyed through consumption. Human beings have been accumulating data in written form for thousands of years. This can be seen in many libraries and museums around the world.
Digital Data has been around since the computers came into existence. In this blog and in big data terminology in general, I will refer to digital data as simply data.
Initially the creation of digital data was limited due to the time it took to create the data and also number of people who were producing the data. But in the last few years, the time it took to create the data has reduced drastically. There are even instruments called sensors that can produce data even when not attended by a human being and that too at rates of many times per second. So the number of data creators has increased many folds. People are also producing data for more hours in a day, even when they are not in the office, when they are commuting and on weekends.
This has lead to the proliferation of data and is now being referred to as Big Data. Big data is supposed to have three main characteristics.
Variety of data
Initially there was only file data. Then came the spreadsheets which are being used even today. Advent of relational databases and client server systems created a revolution in data usage and growth of transactional data. There isn’t probably an organization in the world who don’t use a relational database to store their transaction data. Growth of transaction data lead to data warehouses and Storage Area Networks to store this data. This was all structured data. With the widespread use of the internet, lot of unstructured data started getting produced and lead to the proliferation of variety of data. There are now images, audio, video, blog entries, comments, technical solutions, commentaries, log files etc.
Volume of data
As variety of data increased, so did the volume of data. As mentioned earlier, the barriers for creation of digital data reduced and new avenues of creating data like mobiles, tablets, sensors have proliferated. This has lead to massive increase in the volume of data. Data on the internet was about 8 exabytes in 2008 and 150 exabytes in 2011. It reached mind boggling 670 exabytes at the end of 2013.
Velocity of data
Velocity refers to the rate of growth of data. Internet is growing at more than 30% by year. As number of internet users is increasing by thousands per day, the data consumption and creation are also increasing. Global presence, sensor data, increased digitization and increase in the number of data points collected are causing organizations to see massive rate of growth of data.
Even the data at organization levels are increasing very fast. If there are 100 instruments measuring data at one second sample interval and collecting 1K of data per second will produce more than 3TB of data per year. With the global presence of organizations, if there are 100 centers like above, the organization will have to handle 300 TB of data per year.
Cost of Data Management
With the increase in the volume of data, cost of managing data is also increasing. Though the cost per Terabyte of data has reduced with the drop in the cost of hardware, it is still becoming prohibitive for large organizations. Organizations that used to spend $40 to $50 thousand per Gigabyte of data are probably spending about $50000 per Terabyte of data now. New paradigms of handling big data are tending to reduce this cost even further to less than $5000 per Terabyte.
To conclude, Big Data includes both the structured data like spreadsheets and relational data bases, sensor data to unstructured data like blogs, program logs, images, video and audio. This data is growing at an unprecedented rate and organizations need to gear up to use this data for their advantage. Organizations also have to think about reducing the cost of data management.
Here are some terms for the data sizes relevant to Big Data:
|Gigabyte (GB)||1024 MB|
|Terabyte (TB)||1024 GB|
|Petabyte (PB)||1024 TB|
|Exabyte (EB)||1024 PB|
|Zettabyte (ZB)||1024 EB|
too BIG to IGNORE – book by Phil Simon
Big Data Big Analytics – book by Michael Minelli,Michele Chambers and Ambiga Dhiraj
About the Author:
Ganapathi is an expert in data and databases. He has been managing database projects for many years and now is consulting clients on Big Data implementations. He is a Cloudera certified Hadoop administrator and also a Sybase certified database administrator. He has worked with clients in US, UK, Australia, Japan and India on many large projects. He has helped in implementing large database projects in Sybase, Oracle, Informix, DB2, MySQL and recently SAPHANA. He has been using big data technologies like Apache Hadoop and SAP HANA and has been providing strategies for dealing with large databases and performance issues. He is based out of Bangalore, India.