How to start on Big Data?

By Ganapathi Devappa

So you have realized you have a lot of data and you want to take advantage of the data for your business. Now the question is how can you use Big Data techniques to take advantage of your big data?

Many organizations jump into using Big Data by just starting a project, setting up a team and buying some machines so that they can get a head start on the competition. But after a year into it, they wonder what they are getting in return. In this blog, I try to explain a better way of going about it without losing valuable time.

1. Define the use for the data.

It is not enough to define what you want to find from data but what you are going to do with the findings. Suppose you want to analyze the data and find out your customer churn rate. Define what you are going to do once you find out this information. You can’t say once find out what it is, we will see what to do with it.

A client in logistics industry told me that they wanted to figure out their utilization of vehicles every month over the last one year. I asked them suppose we find that the utilization is X%, then what will you do based on this finding? Will they want to increase this utilization by 5% or 10%? what steps will they take steps to do that? do they have budget for that? Otherwise this will be like another fault finding exercise and have no long term benefits for the company.

2. Estimate the return

Once you decide what you want to do with the findings, do a rough estimate of the return. For example if you increase the customer retention by 1%, then what is the expected increase in revenue? Based on this you can say what type of budget you can have for your Big Data project.

For the logistics company mentioned above, we specifically wanted to know what will be the savings for the company if the utilization increased by 5%. This was a big enough number that made the CxO sit up and say that they wanted to fund this project.

3. Who is responsible?

Fix responsibility and accountability early on. Whoever is driving your project should be given the responsibility and be held accountable to get the results. A year back, probably people were not sure what to expect from Big Data analysis as technology was new. But now things have changed and technology is quite stable, so you can define your expectations and the project can be a well defined project. I always say that ‘there is no responsibility without authority’ – so make sure you give proper authority to whoever is responsible.

4. Baby steps

Don’t start off on a big bang project that will take a year to execute and consume lot of your resources. Start small with a smaller goal and 3-4 months and move from there on. A big data project like consolidating all data from all the departments and producing analytics from this data will probably never take off. It is better to define a smaller goal like getting data from two key departments and get meaningful analysis out of them.

5. What Technology?

Though there is a proliferation of Big Data technology, two technologies have evolved well and are keeping with the Big Data expectations. The two we recommend are SAP HANA and Apache Hadoop.

SAP HANA is the super fast in-memory columnar database from SAP HANA and is one of their fastest selling products. Since lot of your data is already in relational databases, it much easier to load these data onto SAP HANA. SAP HANA also provides mechanism to load web logs and twitter data and has special handling of text data. It provides real-time analysis of data and provides within seconds response for queries that span even millions of rows. It takes advantage of the higher memory available in newer hardware as well as multi-core and multi-cpu architectures.

Apache Hadoop has become synonymous with Big Data these days and provides lot of tools for ingesting, storing and processing Big Data. Apart from Map Reduce and HDFS that are core system of hadoop, it also provides an ecosystem of products like Sqoop and Flume to ingest data from variety of sources and Hive and Hbase to provide relational representation of data. There are also low-latency real time data processing databases like Impala that sit on Hadoop framework.

Three main vendors Cloudera, Harton Works and Map-R have taken a lead on Apache Hadoop and are providing full support for the platform. We at Spider OpsNet also provide support for Hadoop and can help you design a road map and get you started.

Also find out what your team is most comfortable with, as they are the ones who need to use and implement the technology solution. You can send them to a couple of seminars on both the technologies so that you can learn about their comfort level.

6. Leverage Cloud

You can leverage the cloud environment to reduce your initial investments during the experimental phase. You can have Apache hadoop as well as SAP HANA available on the cloud environment so that you can learn with minimal investment of few dollars a day. Amazon web service(AWS) is one such service you can use. Some cloud platforms make economic sense even for a long term use due to lower capital investment and ability to add additional hardware within minutes.

7. Keep watch on the competition

Some studies show that the organizations that adopt big data analysis are 10% better off than the competitors. So keep an eye on other players in your industry so that you are aware what they are trying to do in this space.

8. To lead or to follow?

Big data technology does need significant resources. So decide if you want to be the leader in this space or you want a bigger player to first take the lead and publish their use of big data so that you can adopt it after some time. For medium size organizations, it is better to wait a little longer for some technologies to become main stream before investing in them. That said, hadoop and SAP HANA are quite mature technologies to follow.

9. In-house or outsource?

It is not easy to find Big Data resources in the market. It also takes lot of time train your internal resources on the technology and learning curve is quite steep. Organizations can adopt a mid-level strategy where they get a few resources trained and get out side help in parallel so that they can start the project quickly and at the same time have a way to manage after the project is in place. We at Spider OpsNet are taking up Big Data projects as well as providing resources to our customers for Big Data.

References

For SAP HANA:

http://www.saphana.com/community/learn

For Apache Hadoop

hadoop.apache.org

Impala

About the Author:

Ganapathi is an expert in data and databases. He has been managing database projects for many years and now is consulting clients on Big Data implementations. He is a Cloudera certified Hadoop administrator and also a Sybase certified database administrator. He has worked with clients in US, UK, Australia, Japan and India on many large projects. He has helped in implementing large database projects in Sybase, Oracle, Informix, DB2, MySQL and recently SAPHANA. He has been using big data technologies like Apache Hadoop and SAP HANA and has been providing strategies for dealing with large databases and performance issues. He is based out of Bangalore, India.

Spider