Blog


Test
May 10, 2012

This week I’m attending an interesting conference at UC Berkeley called the “Berkeley conference on Streaming Data”.  The organizers are primarily astronomers and statisticians, but the talks discuss issues and solutions to streaming data problems across a wide selection of scientific areas and engineering applications.  Real-time streaming Big Data applications presented included oceanography biology genetics, reading handwriting, astrophysics, particle physics, recommendation engines for social media, and inevitably, real-time fraud detection from live data feeds.

I presented on a deployment of SQLstream as a Dynamically Scalable Cloud Platform for the Real-Time Detection of Seismic Events. Based on work with UCSD seismologists, SQLstream has been deployed to detect significant events in data collected from a large grid of seismic sensors. A large-scale data infrastructure (the OOI/CI) provides raw signal data over an AMQP message bus.

Plot of Seismic Events

SQLstream monitors live seismic data feeds in real-time, applying heuristic algorithms that look for patterns indicating earthquakes. The live system scales dynamically across multiple servers in a cloud environment based on the current demand. You can view the presentation here.  I also blogged previously on the application here.

In conclusion, I have two main observations from the conference so far (it continues until Friday). The first is that the majority of fields in science and technology appear to have a Big Data and often a real-time Big Data problem.  Secondly, the extent of the innovation and computer science resources dedicated to solving these problems.  In particular for this conference, developing algorithms for data analysis and machine learning (that is automatic pattern recognition) that work on streams of flowing data.  It’s clear that traditional data management and even Big Data batch-based methods don’t work when you need continuous results from dynamic data. And the amount of data is huge.


Test
March 13, 2012

SQL is a declarative language – a SQL query is a specification for the result, it’s neither a recipe nor a program to produce the results. A traditional relational database query returns a set of rows, the ResultSet. A streaming SQL query in SQLstream returns a stream of rows. That is, the ResultSet may never end. In a traditional relational database query, all the rows are fetched, and the ResultSet scans them. With a relational streaming query, the result rows do not exist as yet – as time goes by they come into existence as arriving data are processed.

However, just like any relational database, the SQLstream stream computing Server has two main components:

  1. The query engine or planner, calculates the most efficient plan to produce the requested results – this is the query plan.
  2. The data engine or kernel, executes the query plan to produce the results. The scheduler controls the execution process.

Streaming dataflow graphs

Executing a query means computing the results from the inputs. For a streaming query that means processing the rows in the input streams as they arrive. The execution is organized as a dataflow graph, that is, a mathematical directed graph of nodes and arrows, where the nodes represent elementary operations on data, and the edges into and out of a node represent the input and output data streams. In effect, an assembly line that produces results, where the nodes are the machines or stages on the line.

A query plan defines one of these data flow graphs. The executor runs the data through the graph: it is responsible for executing the nodes, where each node consumes rows arriving on input edges and produces rows on the output edges. Of course each output edge is often the input of a downstream node.

Multiple, connected query plans

In a traditional static database, each query plan is independent and transitory, and operates against persistent tables and indexes. In a relational streaming platform, the query plans last forever and are interconnected. Although streams are used just like tables in SQL, they are not persistent, in fact they have no contents at all, and can be as rendezvous points that accept input rows and pass them on to their output consumers.

Now if the data flow graph were a physical system — say, a collection of transparent plastic straws with colored water flowing through them — then all the processing would be happening simultaneously. However, for the software abstraction of the streaming straw pipelines, it’s not practical or necessary to run all the nodes at the same time. It is the scheduler that manages this network of interconnected query plans, and when and how to execute each node.

In a traditional, static database, the result of a query is a set of rows that are computable all at once. The executor can give good performance by running one node at a time, pushing batches of rows through the graph. A streaming database is different. When the inputs are streams of recent events, arriving in real time, it’s important to produce the outputs fast enough so that the result rows are timely.

The execution works by pushing outputs, not by pulling inputs, and it means executing several nodes at the same time, whenever possible. This requires a finer management of the execution objects and the ability to schedule parallel execution on the nodes.

Parallel scheduling of stream execution

In SQLstream, multiple, interconnected query plans are being executed at the same time. Together they constitute a large dataflow graph in which each node is a mini data processing machine that performs a simple operation on its input data, and passes its output to the next node.

The scheduler is responsible for managing the interconnected dataflow graph. It keeps track of the status of each node: at any moment, some are running, some are ready to run, some are waiting for more input, some are waiting for their output to be consumed. Each node is allowed to run for a quota of time, and where possible nodes are selected to execute in parallel as separate threads.

The SQLstream scheduler may not want to be fair: some branches in the graph (some streams) may be more important, some may need high throughput, some may need lower latency. The application designer decides, and the SQLstream scheduler delivers.

Next time …

This is the first is a series of blogs discussed both the principles and practical examples of parallel stream execution. The next blog in the series will look at some real world examples, and how parallel execution is essential to deliver both high throughput and low latency requirements.


Test
July 20, 2011

SQLstream is helping to predict earthquakes across the world in real time. The system has been developed by a consortium of universities and government agencies, with funding from NSF (National Science Foundation), to provide an infrastructure of networked tools for research in ocean science – constructing an internet-based system to collect and share data.

Ocean Research Program Overview

This is a large system with 16,000 land and sea-based sensors, each sending several channels of seismic event data at a rate of 40 data points per second. The application executes in an Amazon EC2 Cloud on a cluster of SQLstream servers, connected by an AMQP message bus. SQLstream’s AMQP adapter (built with RabbitMQ) enables the streaming SQL application to view the AMQP bus as a domain of input and output data streams. The initial prototype was upgraded to a full-scale system running on a cluster of servers without any changes to the streaming SQL application.

SQLstream’s contribution to the project, an application that processes seismographic data in real-time, demonstrates:

  1. An operational deployment of streaming SQL for scientific calculations in real-time
  2. Real-time distributed processing in an Amazon EC2 cloud using SQLstream and AMQP
  3. Rapid development and rollout of real-time data applications using standards-based streaming SQL.

Seismic Monitoring
The sensor network contains about 16,000 seismic sensors, organized into grids, covering large parts of the North American continent and the adjacent oceans (see illustration below). Each sensor measures the motion of the ground under it in three dimensions, and transmits its data as several digitized channels, typically sampled 40 times a second.

Seismic Sensor Map

While the rate of each signal channel is modest (since seismic waves are low-frequency),
this adds up to a large amount of data to process in real time. Moreover, the rules for detecting a seismic event are heuristics that apply to a time interval of several minutes: so the application has to calculate some quantities from the raw data and to store these calculated values over a time window.

But to detect an earthquake reliably, it’s better to monitor all the sensors at once, looking for a disturbance in the signals that first appear in one place, and then appear in nearby places: a disturbance signal that propagates and changes shape in a way consistent with the physics of a seismic wave.

Real-time sensor network management in an EC2 Cloud
Monitoring tens of thousands of signal channels arriving at 40 sample points per second is a complicated problem, but it can be made simpler by breaking it into stages. We’re interested in earthquakes, which are infrequent, so SQLstream first reduces the amount of data by scanning each channel for patterns that suggest the beginning, the peak or the end of a quake: in other words reduce the dense signal to a sequence of interesting events. Then we can look for events detected on other channels that could be due to the same quake propagating in physical space and time.

In the first phase, we built a real-time seismic event detector in streaming SQL.
We translated a scientific algorithm into streaming SQL, and connected to the scientific sensor data infrastructure – using our AMQP adapter. This involved less than 100 lines of streaming SQL.

In the second phase, the prototype was scaled up to a full-size system, dealing with 16,000 sensor channels, running on multiple SQLstream server nodes created automatically in an Amazon Elastic Cloud. This expansion required no changes to the streaming SQL application developed for the initial prototype – simply running the same streaming SQL application inside an elastic container/manager.

Real-time Event Detection with Streaming SQL
The sensor data processing pipeline has five functional stages:

  1. Reading Messages – over the AMQP adapter. A configuration parameter specifies which “topics” SQLstream subscribes to, that is, which set of sensor channels it receives.
  2. Unpack Data – a user defined transform (UDX) unpacks data channel messages into individual data points.
  3. Signal Processing – extract higher order information from the raw data to identify seismic events given background noise and non-seismic ‘bumps’. An example function would be to calculate the ratios of multiple rolling averages of the signal value (x) over different time windows.

  4. Event Detection – The next stage applies heuristic rules to the processed data streams – the output is much sparser: a stream of significant events, each indicating a possible start/peak/stop of a seismic wave on a particular channel. If events of the correct type occur within a correct interval of each other (as shown in the Signal Plot Diagram), they are accepted as significant.
  5. Writing Messages – output (publish) significant events over the AMQP adapter.

Automatic EC2 scaling for Big Data sensor volume
The pipeline described has been proven to scale well to handle even greater data volume:

  1. Channels are processed independently, so scale as O(N).
  2. Additional channels are processed by adding more pipelines
  3. Each pipeline starts and ends with AMQP messages – the SQLstream / AMQP interoperability has been shown to scale well.
  4. To add a pipeline, we simply add another Elastic Compute server, running the same pipeline streaming SQL, but configured to subscribe to its own set of sensor channels.

This is the first part of a series of blogs describing the seimic monitoring solution. The next blogs will focus on the streaming SQL used in the application, and the SQLstream / AMQP architecture for Big Data scalability.