Imagine you live in a major US city. Now, imagine that every time you want a glass of water, instead of getting it from the kitchen, you need to drive to the airport, get on a plane and fly to Germany and pick up your water there.
From the perspective of a modern CPU, accessing data which is in-memory is like getting water from the kitchen. Accessing a piece of data from the computer’s hard disk is like flying to Germany for your glass of water. In the past the prohibitive cost of main memory has made the flight to Germany necessary. The last few years, however, have seen a dramatic reduction in the cost per megabyte of main memory, finally making the glass of the water in the kitchen a cost effective and much more convenient option.
This orders-of-magnitude difference in access times has profound implications for all enterprise applications. Things that in the past were not even considered because they took so long, now become possible, allowing busi- nesses concrete insight into the workings of their company that previously were the subject of speculation and guess-work.
The in-memory revolution is not simply about putting data into memory and thus being able to work with it “faster”. We also show how the convergence of two other major trends in the IT industry:
a) the advent of multi-core CPUs and the necessity of exploiting this parallelism in software and
b) the stalling access latency for DRAM, requiring software to cleverly balance between CPU and memory activity; have to be harnessed to truly exploit the potential performance benefits.
Another key aspect of in-memory data management, is a change in the way data is stored in the underlying database. This is of particular relevance for the enterprise applications that are our focus. The power of in-memory data management is in connecting all these dots.
The revolutionary Power of an In-Memory Column-Oriented Database
Our experience has shown us that many enterprise applications work with databases in a similar way. They process large numbers of rows during their execution, but crucially, only a small number of columns in a table might be of interest in a particular query. The columnar storage model allows only the required columns to be read while the rest of the table can be ignored. This is in contrast to the more traditional row-oriented model, where all columns of a table—even those that are not necessary for the result—must be accessed.
The columnar storage model also means that the elements of a given column are stored together. This makes the common enterprise operation of aggregation much faster than in a row-oriented model where the data from a given column is stored in amongst the other data in the row.
Scaling due Parallelization Across Multiple Cores and Machines
Single CPU cores are no longer getting any faster but the number of CPU cores is still expected to double every 18 months. This makes exploiting the parallel processing capabilities of multi-core CPUs of central importance to all future software development. As we saw above, in-memory columnar storage places all the data from a given column together in memory making it easy to assign one or more cores to process a single column. This is called vertical fragmentation.
Tables can also be split into sets of rows and distributed to different processors, in a process called horizontal fragmentation. This is particularly important as data volumes continue to grow and has been used with some success to achieve parallelism in data warehousing applications. Both these methods can be applied, not only across multiple cores in a single machine, but across multiple machines in a cluster or in a data center.
Using Compression for Performance and to Save Space in Main Memory
Data compression techniques exploit redundancy within data and knowledge about the data domain. Compression applies particularly well to columnar storage in an enterprise data management scenario, since all data within a column has the same data type and
in many cases there are few distinct values, for example in a country column or a status column.
In column stores, compression is used for two reasons: to save space and to increase performance. Efficient use of space is of particular importance to in-memory data management because, even though the cost of main memory has dropped considerably, it is still relatively expensive compared to disk. Due to the compression within the columns, the density of information in relation to the space consumed is increased. As a result more relevant information can be loaded for processing at a time thereby increasing performance. Fewer load actions are necessary in comparison to row storage, where even columns of no relevance to the query are loaded without being used.
Summary: Why In-Memory now?
In-memory data management is not only a technology but a different way of thinking about software development: we must take fundamental hardware factors into account, such as access times to main memory versus disk and the potential parallelism that can be achieved with multi-core CPUs. Taking this new world of hardware into account, we must write software that explicitly makes the best possible use of it. On the positive side for developers of enterprise applications, this lays the technological foundations for a database layer tailored specifically to all these issues. On the negative side, however, the database will not take care of all the issues on its own. Developers must understand the underlying layers of soft- and hardware to best take advantage of the potential for performance gains.