Purchase your Section 508 Compliance Support guide now!

Purchase your Section 508 Compliance Support guide now!

OLAP architectures

There is no single, ideal way of storing or processing multidimensional data

You can contact Nigel Pendse, the author of this section, by e-mail on NigelP@olapreport.com if you have any comments or observations. Last updated on June 27, 2006.

Contents

Data staging
Storing active OLAP data
Processing OLAP data
The OLAP architectural matrix
Is there an ideal choice?

Introduction

Much confusion, some of it deliberate, abounds about OLAP architectures, with terms like ROLAP, HOLAP, MOLAP and DOLAP (with more than one definition) proliferating at one stage, though the last of these is used less these days. In fact, there are a number of options for where OLAP data could be stored, and where it could be processed. Most vendors only offer a subset of these, and some then go on to attempt to 'prove' that their approach is the only sensible one. This is, of course, nonsense. However, quite a few products can operate in more than one mode, and vendors of such products tend to be less strident in their architectural arguments.

There are many subtle variations, but in principle, there are only three places where the data can be stored, and three where the majority of the multidimensional calculations can be performed. This means that, in theory, there are a possible nine basic architectures, although only six make any sense.

Data staging

Most data in OLAP applications originates in other systems. However, in some applications (such as planning and budgeting), the data might be captured directly by the OLAP application. When the data comes from other applications, it is usually necessary for the active data to be stored in a separate, duplicated, form for the OLAP application. This may be referred to as a data warehouse or, more commonly today, as a data mart. For those not familiar with the reasons for this duplication, this is a summary of the main reasons:

Performance
OLAP applications are often large, but are nevertheless used for unpredictable interactive analysis. This requires that the data be accessed very rapidly, which usually dictates that it be kept in a separate, optimized structure which can be accessed without damaging the response from the operational systems.

Multiple data sources
Most OLAP applications require data sourced from multiple feeder systems, possibly including external sources and even desktop applications. The process of merging these multiple data feeds can be very complex, because the underlying systems probably use different coding systems and may also have different periodicities. For example, in a multinational company, it is rare for subsidiaries in different countries to use the same coding system for suppliers and customers, and they may well also use different ERP systems, particularly if the group has grown by acquisition.

Cleansing data
It is depressingly common for transaction systems to be full of erroneous data which needs to be 'cleansed' before it is ready to be analyzed. Apart from the small percentage of accidentally mis-coded data, there will also be examples of optional fields that have not been completed. For example, many companies would like to analyze their business in terms of their customers' vertical markets. This requires that each customer (or even each sale) be assigned an industry code; however, this takes a certain amount of effort on the part of those entering the data, for which they get little return, so they are likely, at the very least, to cut corners. There may even be deliberate distortion of the data if sales people are rewarded more for some sales than others: they will certainly respond to this direct temptation by 'adjusting' (ie distorting) the data to their own advantage if they think they can get away with it.

Adjusting data
There are many reasons why data may need adjusting before it can be used for analysis. In order that this can be done without affecting the transaction systems, the OLAP data needs to be kept separate. Examples of reasons for adjusting the data include:

    • Foreign subsidiaries may operate under different accounting conventions or have different year-ends, so the data may need modifying before it can be used.
    • The source data may be in multiple currencies that must be translated.
    • The management, operational and legal structures of a company may be different.
    • The source applications may use differenrt codes for products and customers.
    • Inter-company trading effects may need to be eliminated, perhaps to measure true added value at each stage of trading.
    • Some data may need obscuring or changing for reasons of confidentiality.
    • There may be analysis dimensions that are not part of the operational data (such as vertical markets, television advertising regions or demographic characteristics).

Timing
If the data in an OLAP application comes from multiple feeder systems, it is very likely that they are updated on different cycles. At any one time, therefore, the feeder applications may be at different stages of update. For example, the month-end updates may be complete in one system, but not in another and a third system may be updated on a weekly cycle. In order that the analysis is based on consistent data, the data needs to be staged, within a data warehouse or directly in an OLAP database.

History
The majority of OLAP applications include time as a dimension, and many useful results are obtained from time series analysis. But for this to be useful it may be necessary to hold several years' data on-line in this way — something that the operational systems feeding the OLAP application are very unlikely to do. This requires an initial effort to locate the historical data, and usually to adjust it because of changes in organizational and product structures. The resulting data is then held in the OLAP database.

Summaries
Operational data is necessarily very detailed, but most decision-making activities require a much higher level view. In the interests of efficiency, it is usually necessary to store merged, adjusted information at summary level, and this would not be feasible in a transaction processing system.

Data Updating
If the application allows users to alter or input data, it is obviously essential that the application has its own separate database that does not over-write the 'official' operational data.

Storing active OLAP data

Given the necessity to store active OLAP data in an efficient, duplicated form, there are essentially three options. Many products can use more than one of these, sometimes simultaneously. Note that 'store' in this context means holding the data in a persistent form (for at least the duration of a session, and often shared between users), not simply for the time required to process a single query.

Relational database
This is an obvious choice, particularly if the data is sourced from an RDBMS (either because a data warehouse has been implemented using an RDBMS or because the operational systems themselves hold their data in an RDBMS). In most cases, the data would be stored in a denormalized structure such as a star schema, or one of its variants, such as snowflake; a normalized database would not be appropriate for performance and other reasons. Often, summary data will be held in aggregate tables.

Multidimensional database
In this case, the active data is stored in a multidimensional database on a server. It may include data extracted and summarized from legacy systems or relational databases and from end-users. In most cases, the database is stored on disk, but some products allow RAM based multidimensional data structures for greater performance. It is usually possible (and sometimes compulsory) for aggregates and other calculated items to be pre-computed and the results stored in some form of array structure. In a few cases, the multidimensional database allows concurrent multi-user read-write access, but this is unusual; many products allow single-write/multi-read access, while the rest are limited to read-only access.

Client-based files
In this case, relatively small extracts of data are held on client machines. They may be distributed in advance, or created on demand (possibly via the Web). As with multidimensional databases on the server, active data may be held on disk or in RAM, and some products allow only read access.

These three locations have different capacities, and they are arranged in descending order. They also have different performance characteristics, with relational databases being a great deal slower than the other two options.

Processing OLAP data

Just as there are three possible locations for OLAP data, exactly the same three options are available for processing the data. As will be seen, the multidimensional calculations do not need to occur in the place where the data is stored.

SQL
This is far from being an obvious choice to perform complex multidimensional calculations, even if the live OLAP data is stored in an RDBMS. SQL does not have the ability to perform multidimensional calculations in single statements, and complex multi-pass SQL is necessary to achieve more than the most trivial multidimensional functionality. Nevertheless, this has not stopped vendors from trying. In most cases, they do a limited range of suitable calculations in SQL, with the results then being used as input by a multidimensional engine, which does most of the work, either on the client or in a mid-tier server. There may also be a RAM resident cache which can hold data used in more than one query: this improves response dramatically.

Multidimensional server engine
This is an obvious and popular place to perform multidimensional calculations in client/server OLAP applications, and it is used in many products. Performance is usually good, because the engine and the database can be optimized to work together, and the availability of plenty of memory on a server can mean that large scale array calculations can be performed very efficiently.

Client multidimensional engine
On the assumption that most users have relatively powerful PCs, many vendors aim to take advantage of this power to perform some, or most, of the multidimensional calculations. With the expected rise in popularity of thin clients, vendors with this architecture are having to move most of the client based processing to new Web application servers.

The OLAP architectural matrix

Three places to store multidimensional data, and the same three locations for multidimensional engines: combining these gives nine possible storage/processing options. But some of these are nonsensical: it would be absurd to store data in a multidimensional database, but do multidimensional processing in an RDBMS, so only the six options on or below the diagonal make sense. In the interests of completeness, some older products are included in this chart even if they are no longer on sale.


Multidimensional data storage options

Multidimensional processing options

RDBMS

Multidimensional database server

Client files

Multi-pass SQL

1

Cartesis Magnitude
MicroStrategy

 



Multidimensional server engine

2

Extensity MPC
Hyperion Essbase
Longview Khalix
Microsoft Analysis Services
Mondrian
Oracle Express (ROLAP mode)
Oracle OLAP Option (ROLAP mode)
Pilot Analysis Server

WhiteLight

4

Hyperion Essbase
Oracle Express
Oracle OLAP Option AW
Microsoft Analysis Services
PowerPlay Enterprise Server
Pilot Analysis Server
Applix TM1


Client multidimensional engine

3

Oracle Discoverer

5

Comshare FDC
Dimensional Insight
Hyperion Enterprise
Hyperion Pillar

6

Hyperion Intelligence
BusinessObjects
Cognos PowerPlay
Personal Express
TM1 Perspectives
 

 The widely used (and misused) nomenclature is not particularly helpful, but roughly speaking:

Relational OLAP (ROLAP) products are in squares 1, 2 and 3

MDB (also known as MOLAP) products are in squares 4 and 5

Desktop OLAP products are in square 6

Hybrid OLAP products are those that are in both squares 2 and 4 (shown in italics)

The fact that several products are in the same square, and therefore have similar architectures, does not mean that they are necessarily very similar products. For instance, DB2 OLAP Server and Eureka are quite different products that just happen to share certain storage and processing characteristics.

Is there an ideal choice?

Each of these options has its own strengths and weaknesses, and there is no single optimum choice. It is perfectly reasonable for sites to use products from more than one of the squares, and even more than one from a single square if they are specialized products used for different applications. As might be expected, the squares containing the most products are also the most widely used architectures, and vice versa. The choice of architecture does affect the performance, capacity, functionality and particularly the scalability of an OLAP solution, and this is discussed elsewhere in The OLAP Report.


This page is part of the free content of The OLAP Report, but which represents less than a tenth of the information available to subscribers. You can register to access a free preview of a small sample of the large volume of subscriber-only information.