By Ganapathi Devappa
SAP HANA has been in news recently as the lightning fast in-memory database from SAP. As I was getting into Big Data this database captured my interest. It brought back memories of my younger years as Sybase DBA and performance tuning of Sybase databases.
I did test SAP HANA database and found it to be extremely fast and responsive. Even queries aggregating and joining multimillion row tables where the query couldn’t avoid a table scan also returned within seconds. Sub second response was more the norm than careful design.
Here are some of the characteristics of SAP HANA:
- In memory database : most table data is maintained in memory
- Columnar storage
- Reduces storage space
- Reduces storage access time
- Lends itself to parallel processing
- In-built parallel processing – to take advantage of multi-core machines
- In-built data compression – hold more data in memory
- Architecture to reduce disk i/o without compromising on data persistence
Today multi-core cpus are common and memory also has become cheaper so 200 GB or 500 GB machines can be obtained for few thousand dollars. I remember days when we used to fight with management to add 100MB SSD to the database.
SAPHANA is geared towards running real time queries on gigabyte or petabyte databases and get the result in seconds. Traditionally, you will have a transactional database (OLTP) for your transaction processing and then a data warehouse to process your offline reports and analytics. With SAPHANA, you can have one database which can cater to both the needs.
It is not just the data reads, data writes are also very fast. Millions of rows could be inserted in seconds due to the architecture of SAP HANA.
In Memory Advantage
You might have seen advertisements for machines mentioning so many cores and so much of L1/L2 cache and wondered if that made any difference. Below table shows the latency (time to get the first byte) and the speed of data access for various storage types:
|Type of Storage||Typical Size||Access Latency/speed||Persistence|
|L1 cache||64 KB||~ 4 cycles (2 ns)||No|
|L2 cache||256 KB||~10 cycles(5 ns)||No|
|L3 cache (shared)||8 MB||~40 cycles(20 ns)||No|
|Main memory||GBs upto Terabytes||100-400 cycles||No|
|Solid state disk||GBs upto Terabytes||5,000 cycles||Yes|
|Hard Disk||UptoPetabytes||1,000,000 cycles||Yes|
<table of memory speeds>
So cache is much faster, memory is fast and disk I/O s are slowest (They are still faster than the old floppy disks!). This shows how quickly you can get to data if it is in cache or in memory.
SAP HANA database is wired for in-memory access. Important tables are read into memory upon server startup and are kept in memory while the server is running. Data is flushed to disk only occasionally. Data changes are written to an in-memory log that is written to the disk asynchronously. When data is read and there is a change to the data that is in the log, then the changed data is fetched from the in-memory copy of the log. This makes sure that the data is fetched quickly even when data is frequently getting changed.
Tables can be defined as columnar for tables that hold large volume of data. In this case the values in columns are stored separately and only a pointer to the column data is stored in tables. This drastically reduces the storage space required for the in-memory copy of data. This also speeds of the access of data using the column values. Also the searches on multiple columns can be done in parallel to speed up the results further. This allows better usage of CPU cache as well.
The data is also compressed when stored, the data size in memory or disk is much smaller than actual size. On text based data, the size reduction can be as much as 90%. This allows HANA to store more actual data in memory for in-memory processing and also reduced the amount of disk space needed.
SQL scripts are broken into directed acyclic graphs for processing. This makes it easier to identify parallellism in processing and parallel nodes of DAG can be run on different cores. Aggregations (sum, count, average etc) are broken into smaller aggregations that can be run in parallel on different cores. SAP-HANA internally uses the L-language for processing.
You can also create a procedure or function that runs a series of SQL statements. These procedures are precompiled and their query plan stored for faster execution at run time. Like SQL scripts, they are stored as directed acyclic graphs so that parallelism can be detected and used to take advantage of multi-core cpus.
Building Analytical Models
SAP HANA provides an eclipse based integrated IDE for developing applications on SAP HANA database. The IDE is easy to learn for developers familiar with eclipse. You can use it to develop analytical models and views from the database and create a repository for models that can be collaborated with your development team.
The IDE also can be used by business users to build data models from the database without help from developers.
Instead of investing in high end hardware, SAP HANA also provides cloud platform that can be used to reduce your capital costs. Here both hardware and software can be pay as you go.
Java script capability
SAPHANA provides a platform for creating server side java scripts. A UI library, SAP UI5 is provided for developing applications with server side java scripts. Since they run along with database server, they can access the data quickly without the data traveling from database to client. The java script server is built into the database server. SAP has also added SAP UI5 to the open source community with Open UI5.
Analytical predictive library
SAPHANA also provides a library of server side procedures for analytical processing like k-means clustering and predictive algorithms like triple exponential regression. These are written in C++ interfacing directly with the database, thus providing results within seconds even for large datasets. We were able to get power load prediction for next 48 hours based on the per minute data for the last one year within 5 seconds.
In conclusion, SAP HANA is a race horse you can bank on for heaviest of your data loads and processes. Any organization with large amount of data should consider shifting to SAP HANA for all their data processing needs.
Ganapathi is an expert in data and databases. He has been managing database projects for many years and now is consulting clients on Big Data implementations. He is a Cloudera certified Hadoop administrator and also a Sybase certified database administrator. He has worked with clients in US, UK, Australia, Japan and India on many large projects. He has helped in implementing large database projects in Sybase, Oracle, Informix, DB2, MySQL and recently SAPHANA. He has been using big data technologies like Apache Hadoop and SAP HANA and has been providing strategies for dealing with large databases and performance issues. He is based out of Bangalore, India.