Tuesday 27 August 2013

The Real Reason for HANA

With great interest I follow your blogs. Since the introduction of openHPI (online courses of the HPI/Potsdam) I got used to discuss the architecture, the benefits, and consequences of an in-memory DB for both OLTP and OLAP. I can understand that the transition from classic row store DB's to column store DB's for OLTP  workloads is somehow difficult. In all these years of building transactional systems, we put all the focus on data entry. We check the incoming data for referential integrity (lots of single selects), maintain indices (primary and several secondary ones), and update several levels of aggregates for faster reporting.
 
With the growing workload, it became more and more difficult to keep the reporting in the OLTP system. The creation of a data warehouse for financial data came at a high price: redundant data, data transfer, aggregation into cubes with significant latency (hours). Only to achieve a decent response time without disturbing the OLTP system. Let us take the core financial system of the SAP Business Suite and look at the main tables (all smaller tables for configuration, customizing and validity checks will be ignored). The numbers are taken from a productive SAP ERP System which was converted to HANA. All data is in gigabytes.
 
Accounting transaction header             9
                 "                 position                33
Costing journal entry                             94
GL journal entry                                   236
AR journal entry                                       5
AP journal entry                                     15
Total                                                      392
 
As already published in 2009 in a SIGMOD paper, we can drop all the redundant data which were only introduced for speed (e.g. materialized joins, materialized aggregates) and store the remaining tables in a column store. In the column store, columns without values don't take any space. All column with values are being dictionary coded and stored as integers. Most operations the take place on the integer format (scan, compare, sort, etc). Since there are no aggregates to be maintained the number of updates drops and the remaining updates are executed as insert only.

As a result the new financial system on HANA deals with a much lower storage requirement.
 
Accounting transaction header          1.6
                 "                position               8.2
Costing journal entry                         11.3
 
The costing journal entry will also be eliminated in the next version of the financial system. we can expect a reduction of a factor 39x. Since the data on permanent storage is stored in the same compressed fashion, we have the same savings for the backups, test system, development system and the archives.
 
In addition to these savings we gained complete freedom in reporting since no dimension of the recorded data is taken away by aggregation. All reporting structures are being expressed in views. Views can be changed on the fly. Some of the largest audit firms have already validated the great value of this concept for period end closing and consolidation. We can now run a p/l statement while we still take new entries and rerun the p/l instantly (seconds).

In front of this background any discussion whether we should keep 100% of the at least once used columns in memory or establish an aging algorithm is not even of academic value.
 
The above numbers do include the application of hot and cold store together. The numbers of the hot store will be 
less than 50%. Now you can debate whether for the remaining storage requirement the HANA aging algorithm should be activated.
 
Yes, the insert into a column store is slightly slower than into a row store, but you have to consider the time saved since no updates and now index creation are taking place. The new system is fundamentally better than the old one and can only be realized on a single storage in memory system with column store. Any hybrid approach (we tried many) doesn't do the trick. BTW I don't want to get rid of the BW, just not for transactional reporting and analytics, which can be done directly on the transactional data. If there is any concern for to much system load, connect a second node as a slave to the same system. This doesn't add any extra permanent storage and will take all read only load off the main node. For small system, we will use virtualization to provide the appropriate server size.
 
More than 100.000 students have now taken our online courses (openHPI and openSAP) and are familiar with this innovation. I hope I could help you with the discussion about in memory DB's.

No comments:

Post a Comment