There was a lot of interest and discussion at the recent Strata in San Jose around real-time analytics. Seems to be the hot topic, and many have been blogging about it since, for example, Srinath Perera’s blog on why SQL is also a good idea for real-time streaming analytics. I would certainly agree with that sentiment, on the need for SQL for streaming analytics. There are many reasons as to why, some technical such as as the power of the language, and some more business-oriented such as the suitability of SQL platforms for reliable enterprise deployments, better performance and significant lower overall cost when compared with open source and proprietary SQL-like platforms.
Why SQL for streaming analytics
SQL has been the realm of structured queries over static data. However, SQL as an language does not (that never did) imply the need for a structured data store, in fact, it is even more compelling when deployed as continuous queries over data streams. Some of the reasons include:
- SQL is a declarative and expressive language that enables sophisticated real-time analytics to be expressed using simple queries.
- SQL is pervasive, most enterprise developers know and understand SQL, and for those that don’t, it’s easy to learn.
- SQL simplifies real-time analytics by encapsulating the underlying complexity of the data management operations.
- SQL queries can be optimized automatically over distributed systems for significantly (100X) better performance and open source frameworks such as Storm. No need for skills-intensive hand crafting of platform performance.
- SQL applications can be built in a fraction of the time required for low-level open source platforms and proprietary SQL-like platforms – a significant cost saving plus much faster time to implementation.
- SQL platforms can be updated on the fly without having to take them down and recompile – something that is essential for Enterprise deployments and a area where CEP and open source platforms struggle.
- SQL supports user defined operations that can be written in Java and deployed in a SQL query – this covers the small percentage of operations that cannot be readily expressed in SQL for whatever reason.
- And finally, SQL is easy to auto-generate, which enables platforms such as StreamLab for GUI-driven analysis of data streaming and visualization of streaming analytics.
SQL and SQL-like are not alike
I would argue that these advantages however are only true of SQL platforms however, and not for Java or SQL-like platforms, where the latter use SQL-style queries built using Java (or another language) constructs, which sort of defeats the purpose and negates the benefits. There’s a lot more to a streaming SQL platform than having a ‘SELECT’ construct!
Standard SQL standard for real-time streaming analytics
Implementing the SQL standard means adherence to the same SQL that could be deployed in Oracle or DB2, but of course as continuous queries, and for (the most part), the processing of streams of unstructured data rather than structured stored data. SQL support means supporting the SQL data types, operators, statements and clauses. For example, SELECT .. FROM … (stream) WHERE … is a simple example, but adding clauses such as MERGE, JOIN, UNION, OVER, ORDER BY, GROUP BY and PARTITION BY adds powerful real-time correlation and query ability on data streams, particularly when combined with the WINDOW operator for processing data streams over time windows (sliding WINDOWs and with GROUP BY for tumbling windows).
For more information, see our introducing to stream processing with SQL at www.sqlstream.com/stream-processing.