WHY BIG DATA AND WHEN?

By Ganapathi Devappa

When Big Data and why?

You have a mountain of data and it is growing at a very rapid pace. But when is a Big Data solution relevant for you and why should you go for Big Data solution? Here I provide a few pointers to these questions.

In the earlier blog on ‘What is Big Data’, I discussed about the 3 V’s of Big Data – Variety, Volume and Velocity. Organizations that have huge data volumes that are growing at a rapid pace and have a variety of structured and unstructured data need to look at Big Data solutions.

Unstructured Data

Organizations have been building their transaction systems to maintain customer transactions. These transaction databases feed the data warehouse that houses historic data. There is probably an ETL tool that cleanses the data that is loaded into the data warehouse. This is all structured data. Now with internet, there are web related data like blogs, click stream data etc. Also there are program logs that need to be analyzed as well. These are all unstructured data that can be cleansed and loaded into some databases. But the cost of maintaining such databases can be very high and also as the data grows, the performance becomes a bottleneck for analyzing this data. Big Data technologies provide a way to handle unstructured data efficiently and help you keep up with the pace of data growth.

Bursting at the seems

As data grows every year, organizations are filling up their data warehouses and have to move to bigger and better systems (computationally and storage wise). These systems do not scale linearly, so there is exponential increase in cost and also the degradation of performance. As data has grown and more computing power is added, organizations are also hitting the speed barrier. There are limits on how much parallelism you can add to a process. So the processes are hitting the plateau and are taking too long to process.

Big Data technologies have come to the rescue of such organizations by providing ways to economically manage the overflowing data. They provide distributed architecture to distribute the increased load. Big data technologies like Hadoop with share nothing architecture and SAP HANA with in-memory and massively parallel processing provide efficient solutions data growth problems.

Economic impact

As more computational power is needed to handle large quantity of data, the cost is also escalating. Organizations may be spending as much as $50,000 per TB of data. As data sizes are growing towards peta bytes, the costs become prohibitive. With some opernsource Big data solutions, the cost can be reduced to less than $2000 per TB of data. This alone is prompting many organizations to jump into using Big Data technologies. The incremental cost of adding additional computing and storage resources as the organization expands is also very low.

Competition

In every field, competition has increased drastically. With globalization, you have to compete not only with new local players but also with global players diversifying and entering your territory. So appealing to customers and retaining customers has become even more priority. Study of customer behavior is important as there are many channels through which a customer may access your services. Identifying and retaining loyal customers and prevent customer churning is very important for any business.

Better decisions

It is not just enough to maintain transactions and customer data. Data provides lot of insights into customer behavior as well. What is the use of data if you can’t make decisions based on data? Intuition normally comes with experience and the data you have is basically the experience of your organization. So if this data can be used to make decisions, the decisions will be better than the intuition of few individuals. Studies have found that businesses that use their big data to make decisions are 20% better off than their counterparts in growing their business.

Subset of data vs. whole of data

Traditionally, organizations used a subset of the data that they thought was most relevant and used this data to make decisions. But what if you could use the whole of your data to make decisions? Time used to be a major constraint here as it would have taken days to analyze Terabytes of data and come to some conclusions. But with massive parallelism of big data solutions and reduction in cost of processing a Terabyte of data it has become possible to use all of your data for analysis. Real time use of data is important in situations like credit card fraud detection and online customer analysis. Near real time use of data is required for some other situations like inventory replenishing and power load prediction.

Predictive vs. analytic

Traditionally BI tools were used to interface with company’s data warehouses and make conjectures about customers like which product was in demand, which product is out of stock etc. With more data available, companies are looking at using the data to build predictive models and tell managers that this is what you should do next. The level of accuracy of prediction depends on the amount of data you have (largely) and Big data technologies help in finding the predictions. For example you can integrate R (language used for analytics and predictions) with Hadoop or SAP HANA so that you have the parallel processing capability as well as way to use the massive data you have.

To conclude, Big Data technologies have provided a paradigm shift in the way data is processed and organizations need to move away from traditional data warehouse technologies to handle the growth and take advantage of the data that already exists in the organization.

About the Author:

Ganapathi is an expert in data and databases. He has been managing database projects for many years and now is consulting clients on Big Data implementations. He is a Cloudera certified Hadoop administrator and also a Sybase certified database administrator. He has worked with clients in US, UK, Australia, Japan and India on many large projects. He has helped in implementing large database projects in Sybase, Oracle, Informix, DB2, MySQL and recently SAPHANA. He has been using big data technologies like Apache Hadoop and SAP HANA and has been providing strategies for dealing with large databases and performance issues. He is based out of Bangalore, India.

Spider