Streaming SQL for Hadoop

Make the Elephant Fly. Real-time Big Data with SQLstream

‘Real-time’ and ‘Hadoop’ had been considered synonymous, yet Hadoop is not as real-time as many have hoped. Hadoop has many strengths, but was never intended for low latency, real-time analytics over high velocity machine data streams. With the SQL language emerging as the key enabler for the mainstream adoption of Hadoop, executing streaming SQL queries over Hadoop extends the platform out to the edge of the network, making it possible to query unstructured log file, sensor and network machine data sources on the fly and in real-time.

Real-time Operational Intelligence on Hadoop

SQLstream accelerates Hadoop to process live, high velocity unstructured data streams,  delivering the low latency, streaming operational intelligence demanded by today’s real-time businesses.
streaming operational intelligenceSQLstream for Hadoop combines SQLstream’s real-time operational intelligence from high velocity machine data with the power of Hadoop for high volume data storage and on-going analysis. SQLstream for Hadoop enables:

  • Stream persistence – Hadoop HBase as an active archive for streaming data and derived intelligence using the Flume API. SQLstream also performs continuous aggregation to support high velocity streams without data loss.
  • Stream replay – restream the complete history of persisted streams from HBase for ‘fast forwarding’ of time-based and spatial analytics. Various interfaces can be utilized, including Cloudera’s Impala.
  • Streaming data queries, joining streaming real-time data with historical streams and intelligence persisted in HBase.

From SQL to NoSQL to Streaming SQL

The first phase of Hadoop and Big Data saw the emergence of NoSQL data storage platforms, looking to overcome the rigidity of normalized schemas. However, as the technology hits mainstream industry, the need for simpler, high performance and reliable queries is driving a resurgence in SQL as the de facto language for Big Data processing (for example, Cloudera Impala and Google BigQuery). What is now apparent is that SQL is the ideal language for processing data streams using real-time, windows-based queries. The issue with normalization and rigid schemes is a non-issue for a streaming data platform – there are no tables, no data gets stored!

What is Streaming SQL?

SQL was developed to process stored data in a traditional RDBMS. It has a massive existing skills base, proven scalability and sophisticated dynamic query optimization. It also functions equally well, if not better, as a real-time stream computing query language. SQLstream’s ANSI SQL:2008 streaming SQL queries are exactly that – standards compliant. We test our SQL queries for standards compliance against the leading RDBMS SQL platforms. There are however two differences. SQLstream’s core s-Server stream computing platform does not persist any data before processing (Hadoop HBase is the default storage platform for stream persistence although any data storage platform can be supported), and streaming SQL queries execute continuously, processing new data as they are created. So why SQL as a stream computing language?

  • Proven scalability with sophisticated query optimization.
  • Rapid development – a few SQL rules have immense power.
  • SQL skills are readily available in the marketplace worldwide.
  • Supports direct migration of SQL applications to and from existing databases and data warehouses.

A Streaming SQL Example

The following query is a basic example of a streaming SQL query. The query finds Orders from New York that ship within one hour. Unlike a traditional static SQL query, this query executes continuously, processing new data as they arrive across all streams in the join, and pushing out results as the query condition is met. The keyword STREAM is used to maintain standards compatibility as without it the query would return a table not a stream of results that continue ad infinitum.
SQLstream Streaming SQL Query
Streaming SQL supports all standard SQL operations for data streams, including:

  • Stream Select, Insert and Update
  • Stream Join
  • Streaming Partition By and Group By
  • Full set of arithmetic, string, logical, date and timestamp operators
  • Support for User Defined Functions (UDXes)

Streaming SQL queries over Hadoop

SQLstream s-Server, our core streaming computing platform, operates both as a streaming Big Data engine and as a streaming SQL language extension for Hadoop HBase. In Hadoop mode, Hadoop HBase is utilized as the default platform for stream persistence. Streaming SQL for Hadoop HBase Data can be streamed directly into Hadoop HBase in real-time, including the raw machine data as it is collected from the log files, applications and sensors, also filtered and enhanced versions of the same streams, as well as any pre-aggregated and analytical intelligence information. SQLstreams streaming SQL language support for Hadoop offers:

  • Real-time operational intelligence on Hadoop without low-level coding
  • Stream persistence for all raw machine data and derived intelligence information
  • SQLstream Connector for Hadoop HBase maintain and utilize your Big Data storage platforms in real-time.
  • Streaming integration between Big Data storage platforms.
  • Replay persisted streams for time-based and geospatial analysis of existing stored data.

A key advantage with SQLstream is the ability to extract and replay processed data from Big Data storage platforms and join this information with the incoming, live data streams. Operational intelligence results are enhanced by combining real-time data against known trends, eliminating false alarms and longer term comparisons. The extraction and data processing in SQLstream uses standards-based SQL queries, enabling powerful real-time queries to be deployed over streaming stored data.

 

Contact us today.

Please contact us to understand more about our Big Data solutions, or about SQLstream's products and capabilities.

+1 877-571-5775

Signup for a Download