![]() |
|
| Continuous OLAP | ||||||||
|
Streaming SQL and OLAP are two of the most interesting and powerful paradigms in data processing. OLAP is a well-established technique for analyzing large databases of historic data. Streaming SQL is a more recent innovation that applies the declarative power of the SQL language to the problem of managing data in motion. Why Combine OLAP with Streaming SQL?This combination can solve some business problems that can't be solved any other way. OLAP is usually hampered by conventional ETL techniques: it is difficult to keep the data warehouse up to date, because batch-based ETL processes are only efficient when dealing with a few hours or days of data. OLAP engines excel at comparisons:
Powered by a streaming SQL engine, an OLAP engine can also include the most current data in its analysis, such as this hour's data compared to the average for this hour of the day over the last 6 months. This potential immediacy enables a previously unavailable responsiveness at both the application level and customer-facing opportunities. This data isn't stored on disk - we call it data in flight as opposed to conventional data at rest - and conventional ETL has difficulty accessing it. Streaming SQL allows you to bring this data into the same format as other enterprise data, and to analyze and act on it in real time. How mondrian Can Work with SQLstreamMondrian requires its data to be stored in a relational database. To ensure high
performance on large data sets, mondrian caches query results in
memory, and also uses aggregate tables populated with
summaries of the data. For rapidly changing data, mondrian's cache and aggregate tables both require careful management .
AcquisitionSQLstream helps in acquiring the data by subscribing to sources of data-in-flight as well as by accessing the traditional data warehouse sources: databases, mainframes, and files extracted from other operational systems. Traditional ETL processes are limited to those relatively static sources, whereas SQLstream makes available data-in-flight in diverse formats: messages on message-oriented middleware, web service calls, TCP network packets, and so forth. It can monitor database tables to generate events for every new transaction, and tail a log file to read rows as they are appended to the log file. One of SQLstream's core concepts is a stream, analogous to a table in a relational database but with major additional operational advantages. While a table contains a finite set of rows inserted at some past time and stored on disk, a stream contains an infinite sequence of rows that arrive whenever the producer decides to send them. (SQLstream in fact supports tables too, so that you can combine historical or reference data with event data.) What streams and tables have in common is the fact that you can manipulate them using SQL queries. Not just the simple operations like filtering and routing, but operations that combine multiple rows, such as join and aggregation. You can combine rows from the same stream (often demarcated by a time window of interest), from other streams, and from historical and reference data. TransformationNext, you need to prepare the data and convert it into a form suitable for large-scale analysis. In SQLstream, you can use SQL to perform a real-time, continuous ETL process. For example:
A caution: using many aggregate tables when data rates are extremely high becomes limited by the I/O capacity of the DBMS, making it impossible to keep the aggregate tables 100% up-to-date. You should reduce the number or granularity of the aggregate tables, and partition each aggregate table by time to ensure that only one block per time period is being actively written to. The active block of aggregate tables can then fit into the DBMS's buffer cache. LoadingLoading the data warehouse is straightforward. SQLstream's Database Adapter makes DBMS tables appear as foreign streams, so that writing to these streams makes an insert, update or delete occur in the data warehouse. As data is loaded into the data warehouse, it becomes inconsistent with the state of mondrian's cache. Mondrian's cache is necessary for performance if mondrian has many concurrent users or if the data warehouse is so large that SQL queries take a long time. However, flushing the entire cache every time there is an update negates the value of the cache. NotificationFortunately mondrian has an API to let you notify mondrian of changes that affect its cache contents. You can tell mondrian specifically which data changed. For example, you can say "there was just a sale of beer in Texas," and mondrian will mark precisely these entries in the cache as invalid. That setting forces a re-read from the database when the next OLAP query requests those entries. Once again, the problem can easily be solved using a foreign stream. The foreign stream should call mondrian's cache control API for each row it receives. A SQLstream pump object ensures that every record written into the fact table is mirrored into the foreign stream, and therefore mondrian's cache is kept in sync with the DBMS. ConclusionIn conclusion, there is a synergy between OLAP and streaming SQL techniques that allows new business problems to be solved and existing problems to be solved much more efficiently. SQLstream provides a platform for all manner of continuous ETL operations, and mondrian with its open-source license and extensible Java architecture is a natural fit.
|
||||||||