Gartner has predicted that the data processing market will grow to a value of $55B by 2016 and that much of that value is held in the data, waiting to be exploited. Maximizing the potential means analyzing data in motion, as soon as the data are created, as well as data at rest, once the data are stored. In fact, the ideal scenario is to combine the two in an integrated architecture. This is simpler to achieve that many would think, particularly with the right integration architecture in place, and the right tools for building analytics and integrations.
Data stream processing is most commonly applied to unstructured machine data streams – event data generated by servers, applications, sensors, devices and networks. And many industries are now utilizing data stream processing to provide continuous, real-time visibility in their operations from streaming analytics and to drive automated updates and actions as a result. However those industries also have significant investment and business processes based on their analytics from the stored data. So how to combine the best of both worlds?
A unified data analytics architecture would contain systems for data stream processing, machine data staging and pre-processing, and the enterprise data warehouse(s). The machine data staging and pre-processing platform is typically Hadoop / HBase, or an alternative NoSQL platform. The enterprise data warehouse would be the existing DBMS, where important requirements include ACID compliance, reusabiliy of analytics and enterprise-wide access.
Streaming Analytics for data in motion with Continuous ETL to load data at rest
The integration framework for a unified architecture is built on the connector and API capabilities of all interconnected systems. For a DBMS, this is typically the JDBC driver, and for Hadoop, this would be fast dataload into HDFS (file updates) with read through a NoSQL platform such as HBase. However, both must be supported by the stream processing platform in order to support read and write to both types of systems simultaneously.
SQLstream Blaze includes native SQL connectivity to both 3rd party database platforms and Hadoop / NoSQL platforms. The connectors include Hadoop connectivity plus a standards-based SQL/MED (Mediation of External Data) framework that utilizes JDBC connectivity and allows tables in external databases to be accessed as if they were local tables in SQLstream (using SELECT, INSERT and MERGE operations). SQLstream Blaze to data storage connectivity is used to support any or all of the following use cases:
- Continuous ETL for unstructured machine data streams. Continuous ETL enables databases and data warehouses to be maintained in real-time based on the continuous aggregation and filtering of unstructured event data streams. Continuous ETL enables accurate, timely business reporting (the data is always accurate, no need to wait for nightly aggregation runs for example).
- Static – stream table joins. The majority of real-time systems utilize existing stored data, for data augmentation and enhancement (for example, joining with customer attributes), or for greater accuracy by joining real-time data with longer term trends and predictive analytics.
- Re-streaming for time-based analysis of stored data. Stored event and time-series data can be replayed (in fast forward mode), enabling ad-hoc, time-based analysis and scenario analysis over much larger datasets.
SQL as the big data analytics language of choice
A further key requirement is the use of a single analytics language and increasingly SQL is the analytics language of choice for both structured and unstructured data. SQL is of course the primary analytics for the DBMS, but is also available through many platforms for Hadoop and NoSQL storage platforms. SQL is also a powerful analytics language for data streams as SQLstream Blaze has shown, the only difference being streaming SQL queries execute continuously over moving data streams, rather than repeatedly over stored data at rest. Given that most are familiar with SQL analytics over stored data, here’s a few examples of typical streaming SQL queries that would be executing in a unified data in motion / data at rest architecture.
As per any SQL platform, the types of analytics and analytical processes for data stream processing fall into four broad areas – alerts, analytics, predictive analytics and aggregation. The main difference being that for SQLstream Blaze, these generate real-time results with millisecond latency as measured from the time of data arrival. There are also other operations that contribute to more powerful analytics such as partitions (PARTITION BY) and joins (JOIN, UNION). Some simple examples include:
The requirement is to generate an output alert through a connector to a notification system or other external platform. For example:
SELECT STREAM ROWTIME, City, Temperature FROM WEATHERSTREAM WHERE “City” = ‘San Francisco’ AND “Temperature” > ‘100’;
This query processes an input stream of temperature readings and outputs a record containing the time, location and temperature reading for every input reading matching the selection criteria. The latency from the time the data record was emitted by the sensor to the alert being issued can be as little as a few milliseconds, even where data arrive at millions of records per second.
Aggregations (Tumbling Windows)
Streaming aggregations are used extensively in stream processing, and form the basis of continuous aggregation / ETL operations into Hadoop and DBMSs. Queries utilize functions such as AVG, COUNT, MAX, MIN, SUM, STDDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP, and output a record at specified intervals for each GROUP of input records. For example:
SELECT STREAM FLOOR(WEATHERSTREAM.ROWTIME to MINUTE) AS FLOOR_MINUTE, MIN(TEMP) AS MIN_TEMP, MAX(TEMP) AS MAX_TEMP, AVG(TEMP) AS AVG_TEMP FROM WEATHERSTREAM GROUP BY FLOOR(WEATHERSTREAM.ROWTIME TO MINUTE);
The result is a stream of new records, one per minute, specifying the maximum, minimum and average readings recorded in that minute. This type of query is particularly useful for generating periodic reports with zero latency from the input data streams.
Analytics (Windowed Analytics)
Windowed analytics are the basic building block for streaming analytics. Queries generate an incrementally updated output record (or row) for each new input record. Each field (or column) in the output record may be calculated using a different window or partition. Windows can be time or row-based. For example:
SELECT STREAM ROWTIME, MIN(TEMP) OVER W1 AS WMIN_TEMP, MAX(TEMP) OVER W1 AS WMAX_TEMP, AVG(TEMP) OVER W1 AS WAVG_TEMP FROM WEATHERSTREAM WINDOW W1 AS (RANGE INTERVAL '1' MINUTE PRECEDING);
In this query an output record (or row) is generated for each new input record, specifying the updates to the minimum and maximum temperatures over the preceding 60 seconds, plus an incrementally updated average for the temperature over that period.
The Advantage of a Unified Data in Motion / Data at Rest architecture
The integration of data stream processing with DBMS and Hadoop can therefore be entirely SQL standards-compliant. At the system level, the data at rest are updated in real-time, and for trend and other data to be joined with streaming data. For developers, this offers the ability to:
- Execute analytics over stored or streaming data using the same SQL queries.
- Maintain up to the second accuracy of stored data using streaming aggregation / continuous ETL.
- Build powerful predictive analytics by joining trend and predictive analytics from the DBMS with streaming predictive analytics in SQLstream Blaze.
However, for the end customer, the technical capability and benefits of an integration stored/streaming analytics platform translate into significant business benefits:
- Better strategic decision making through real-time updates of the data driving their business reporting.
- Improved operational efficiency through real-time insights into their business operations and the ability to automate actions and updates.
- Generate new revenue streams by extended existing applications for real-time access and performance.