Blog


Vice President Marketing
June 25, 2012

We’ve been exhibiting at Structure 2012 in San Francisco, where our CEO Damian Black was speaking on dataflow architectures for massively scalable real-time Big Data computing. In fact, this was a milestone for us as Damian was on the very first Big Data panel at the first Structure event in 2008.

Dataflow is a technique for parallel computing that emerged from research in the 1970s. It’s based on graph-based execution models where data flows along the arcs on a graph and is processed at the nodes. It was decades ahead of its time in an era when hardware was expensive and real-world requirements for massively parallel, low latency computing architectures were not required in the mainstream. However, dataflow as an architecture has found its place and time, with the emergence of Big Data volume, real-time low latency requirements, commodity hardware and low cost storage. Dataflow is driving the architectures for today’s real-time big data solutions.

Structure 2012 - Dataflow comes of age

Click to view Structure 2012 presentation video

SQLstream adopted the principles of dataflow as the basis of our architecture for SQLstream s-Server. Our adapters turn any data source into a live stream of data tuples which are combined, aggregated and analyzed by the SQLstream s-Server platform. SQLstream has added one essential feature to data flow – the use of SQL as a dataflow management language. SQL has been used for some time as the language of choice for relational database management systems, and in this context is getting a bad press in light of new structures for Big Data storage and NoSQL queries. However, SQL is powerful, declarative (therefore applications can be built easily, quickly and cheaply) and is a natural, powerful paradigm for processing streaming dataflows. The benefit is extremely low latency with the ability to process massive volumes of live data over an unlimited number of servers – exactly the requirements of real-time Big Data. In fact, this is the only architecture capable of processing real-time Big Data streams. With real-time requirements now in the 20 to 100 million events per second range, power, scalability and low latency are key.

Dataflow architecture for real-time streaming Big Data computing

Diagram 1: Dataflow architecture for real-time streaming Big Data computing

The SQLstream s-Server architecture concept is illustrated in Diagram 1. As a dataflow architecture, each node is a streaming SQL statement – a continuous SQL query, processing arriving data over a moving time window (time windows can be from 1 millisecond for ultra low latency requirements, to months or even years where comparison against long term moving averages is required, for example, Bollinger bands). Why is this important? Well, it’s the only approach for low latency, real-time solutions, as information flows out of the system as soon as input data arrives, that is, the high latency of batch-based approaches such as Hadoop Map-Reduce is removed completely.

Mozilla Glow: Real-time download monitor with SQLstream and HBase

Mozilla Glow: Real-time download monitor with SQLstream and HBase

Damian presented a simple example of SQLstream and parallel dataflow in action. Mozilla’s Glow application is a continuously updating download counter for the Firefox 4 browser when it was released. The application used SQLstream s-Server to collect live download statistic from all the download servers worldwide. Download records were processed and aggregated in real-time and displayed on the Glow visualization map, illustrating exactly how many copies of the browser had been retrieved. SQLstream s-Server also provided a continuous ETL operation into Apache Hbase, storing aggregated and filtered records for further in depth analysis. Click here to watch the application in action.

Finally, and in contrast to Structure, we also attended a Gartner session last week with Merv Adrian and Svetlana Sicular, which sought to bring some sense of perspective to Big Data. This was really a reality check as to the current maturity levels of the Hadoop Big Data platforms and the effort required to deploy. The wider adoption across industry in general will require significantly more mature products and applications, particularly around the OPEX costs for deployment, security concerns and ability to deliver business intelligence all consumers in an large organization. The recommendation was to use an integrator such as Cloudera or Hortonworks. Mainstream organizations are looking at the Hadoop / Big Data approach, but many do not currently see either a use case or a reason for adoption. It was interesting to hear a perspective that didn’t need to be buzzword compliant, and presented a positive yet realistic perspective on wider adoption.

Posted under Big Data · Streaming SQL · Uncategorized

Vice President Marketing
June 7, 2012

We’re at Sensors Expo this week, showcasing in the Big Data & Analytics Pavilion. This is the first year the event has included a specific area for real-time Big Data solutions for sensor networks.

Real-time control in a Big Data World

SQLstream CEO, Damian Black, presented on Real-time Control in a Big Data World. The presentation focused on the increase of sensor data and the emergence of the “Sensor Internet”, plus the applications required to collect and analyze streaming sensor data, and to drive real-time actions and updates. In particular, addressing the emerging real-time Big Data challenges in this area driven by wireless and GPS technologies, M2M applications and V2V/V2I.

SQLstream Damian Black Sensors Expo 2012 Real-time Big Data Integration

Click here to view Real-time Big Data integration and analytics for sensor networks

Real-time streaming data integration for Big Data

It’s clear the primary challenge is not managing the data volume per se, or even delivering real-time operational intelligence, rather it’s the more fundamental issue of real-time streaming data integration. How can such huge volumes of data from many different sources and locations be integrated into the operational platforms, and how can the issues of multiple operational siloes be overcome to provide an integrated real-time control platform.

Interestingly, these are the exact same issues SQLstream addresses for the Big Data and Hadoop world in general – getting data in, getting data out, connecting existing data stores in real-time, and delivering real-time in-memory analytics on the data as it streams past:

  • Real-time streaming data integration of any data source and between existing storage platforms and operational systems
  • Real-time streaming monitoring and analytics on the arriving and streaming data
  • Scalability through parallel distributed processing of processing pipelines

The importance of geospatial analytics

Geospatial analytics is a key requirement in the sensor data market. Big Data analytics in general is about one dimensional problems, usually the correlation of similar events, or the correlation of events over time. The geospatial dimension is the key difference between Big Data platforms for the “Sensor Internet” and the wider IT / machine data applications. Fortunately this has been a feature of SQLstream for some time, and central to many of our customer deployments. For example, real-time traffic analytics from GPS data, and real-time seismic monitoring.

Posted under Uncategorized

Vice President Marketing
June 5, 2012

Glue Conference 2012 , Denver CO, at the end of May was a great conference, well attended, knowledgeable participants and is the only conference I know that looks at gluing cloud and mobile applications together with a developer focus.

There was the usual wave of NoSQL, cloud storage, cloud platforms and Hadoop presentations, as you’d expect, but also with some interesting keynotes as well. Ray O’Brien, CTO for IT at NASA. talked about the evolution of Nebula and OpenStack at NASA, and James Governor from Redmonk, talking about the evolution of historical analytics.

From our perspective, the strength of the show was in making physical rather than logical connections. Both partnerships and potential customer interest in building real-time Big Data applications, and how SQL has been repurposed as an API for streaming Big Data, moving it forward significantly from its roots as static data management language.

Real-time, streaming Big Data

Relational Streaming for real-time Big Data scaling

Relational Streaming for real-time Big Data scaling

Our CEO, Damian Black, presented on real-time streaming Big Data, both as a real-time alternative to Hadoop, and also as a complement to add real-time responses and streaming integration to existing Hadoop installations. One question we were asked several times was why SQL? A good question. This isn’t a religious debate about the language by any means, and if we had opted to build a Big Data batch storage and analytics platform (e.g. like Hadoop), we would have gone a different route.

However, when it comes to processing streaming tuples in real-time, a standard SQL approach has two big advantages over all others. First, with the extension of the SQL WINDOW operator to process streaming data over fixed time windows, both structured and unstructured data can be processed painlessly without having (no pun intended) to define a static schema and without the need for any coding whatsoever. In effect, SQLstream processes streams of arriving tuples over time windows and pushes out the results to other systems.  Similar in concept at least to Hadoop, although Hadoop is purely batch-based, processing static files and pipelining sets of tuples through low level Map-Reduce functions.

Relational Streaming for Real-time Big Data

Click on the title slide to view the real-time Big Data presentation

However, the second benefit is equally important. Streaming SQL queries include standard operators such GROUP BY and PARTITION. These provide the best clues possible to a query planner capable of automating the dynamic scaling of streaming pipelines over vast numbers of servers. This gives a reliable and controllable mechanism for Big Data scalability without the need for hardcoding server allocation hints.

Real-time at GlueCON

The strength of the real-time track at GlueCON was encouraging. It was interesting though that the term ‘real-time’ is now about as over used as ‘Big Data’, and about as poorly understood. For SQLstream, it’s the streaming integration of any and all data sources with in-memory analytics, processing streams at millions of events per second. For some other vendors, it appears real-time drops off at significantly lower rates and numbers of connections!

Next stop, GigaOM Structure in San Francisco

Next stop GigaOM Structure in San Francisco, June 20 / 21 at the Moscone Center. Visit us there if you’re attending.

Posted under Big Data · Real-time · Streaming SQL · Uncategorized