![]() |
|
| Real-time Data Warehousing | ||||||||
|
Data warehouses are all about reporting and analysis, and OLAP is well-established as valuable for analyzing large historic databases. Typical Limitations But keeping the data warehouse up to date can be difficult, because batch-based ETL processes are only efficient when dealing with a few hours or days of data. A further limitation arises from the fact that data from disparate sources using different data models require pre-processing before even being stored in the data warehouse. These limitations typically exclude the most recent data from OLAP analyses, restricting their optimum potential for informing enterprise decisions. Breakthrough Such limitations are minimized through the use of streaming SQL, a recent innovation, which applies the declarative power of the SQL language to the problem of managing data in motion. Combining OLAP with streaming SQL can handle some otherwise insoluble business problems. While OLAP engines excel at comparisons, such as comparing this quarter's data or analyses with last year's, or one brand against another, streaming SQL adds immediacy. Immediacy of Access, Analysis, and Response Your analyses can include "right now" facts, transactions, or relationships that constitute the highest value data in the enterprise. Such data include all kinds of remote procedure calls, state changes of critical systems, business events, and interrelationships among these. For example, you can compare this hour's data and relationships with relevant aggregates or averages for this hour of the day over the last 6 months. SQLstream real-time data warehousing applications can thus provide a previously unavailable responsiveness to both application integration and the accurate handling of customer-facing opportunities. By capturing and processing streaming data --- data in flight --- rather than static, older data stored on disk, your queries ask and answer up-to-the-minute questions with real-time precision. Conventional ETL has difficulty accessing such data. Yet streaming SQL enables you to bring this data into the same format as other enterprise data, to combine it as needed, and to analyze and act on it in real time. How Mondrian Can Work with SQLstreamMondrian processes its data from a relational database, caching query results in
memory to ensure high
performance on large data sets. It also uses predefined, concurrently maintained aggregate tables populated with
summaries of the data. For rapidly changing data, mondrian's cache and aggregate tables both require careful management .
AcquisitionSQLstream helps in acquiring the data by subscribing to sources of data-in-flight as well as by accessing the traditional data warehouse sources: databases, mainframes, and files extracted from other operational systems. Traditional ETL processes are limited to those relatively static sources, whereas SQLstream makes available data-in-flight in diverse formats: messages on message-oriented middleware, web service calls, TCP network packets, and so forth. It can monitor database tables to generate events for every new transaction, and tail a log file to read rows as they are appended to the log file. One of SQLstream's core concepts is a stream, analogous to a table in a relational database but with major additional operational advantages. While a table contains a finite set of rows inserted at some past time and stored on disk, a stream contains an infinite sequence of rows that arrive whenever the producer decides to send them. (SQLstream in fact supports tables too, so that you can combine historical or reference data with event data.) What streams and tables have in common is the fact that you can manipulate them using SQL queries. Not just the simple operations like filtering and routing, but operations that combine multiple rows, such as join and aggregation. You can combine rows from the same stream (often demarcated by a time window of interest), from other streams, and from historical and reference data. TransformationNext, you need to prepare the data and convert it into a form suitable for large-scale analysis. In SQLstream, you can use SQL to perform a real-time, continuous ETL process. For example:
A caution: using many aggregate tables when data rates are extremely high becomes limited by the I/O capacity of the DBMS, making it impossible to keep the aggregate tables 100% up-to-date. You should reduce the number or granularity of the aggregate tables, and partition each aggregate table by time to ensure that only one block per time period is being actively written to. The active block of aggregate tables can then fit into the DBMS's buffer cache. LoadingLoading the data warehouse is straightforward. SQLstream's Database Adapter makes DBMS tables appear as foreign streams, so that writing to these streams makes an insert, update or delete occur in the data warehouse. As data is loaded into the data warehouse, it becomes inconsistent with the state of mondrian's cache. Mondrian's cache is necessary for performance if mondrian has many concurrent users or if the data warehouse is so large that SQL queries take a long time. However, flushing the entire cache every time there is an update negates the value of the cache. NotificationFortunately mondrian has an API to let you notify mondrian of changes that affect its cache contents. You can tell mondrian specifically which data changed. For example, you can say "there was just a sale of beer in Texas," and mondrian will mark precisely these entries in the cache as invalid. That setting forces a re-read from the database when the next OLAP query requests those entries. Once again, the problem can easily be solved using a foreign stream. The foreign stream should call mondrian's cache control API for each row it receives. A SQLstream pump object ensures that every record written into the fact table is mirrored into the foreign stream, and therefore mondrian's cache is kept in sync with the DBMS. ConclusionIn conclusion, there is a synergy between OLAP and streaming SQL techniques that allows new business problems to be solved and existing problems to be solved much more efficiently. SQLstream provides a platform for all manner of continuous ETL operations, and mondrian with its open-source license and extensible Java architecture is a natural fit.
|
||||||||