Blog


Vice President Marketing
October 19, 2012

GigaOM took their successful format for the Structure events to Europe this week for the first time and SQLstream’s CEO, Damian Black, was one of the key speakers. In a session with GigaOM’s leading Big Data writer, Derrick Harris, Damian discussed how real-time Big Data can only be achieved using continuous streaming integration, and how standard SQL is the perfect vehicle as a powerful language for the massively parallel distributed processing of live data streams.

As always, the GigaOM events were extremely well executed,and drew many of the main thought leaders and business leaders in the industry. For example, Ame Awadakkah CTO of Cloudera and Werner Vogels CTO of Amazon.

Real-time Distributed Big Data Processing

Damian’s talk was entitled ‘The Rebirth of Dataflow Computing: Real-time Distributed Big Data Processing’, and discussed how dataflow, an architecture for massively parallel distributed computing, is ideal for streaming Big Data integration and analysis. And SQLstream’s dataflow architecture harnesses the power of parallel dataflow using standard SQL as the data management language.

SQL is ideally suited to driving massively scalable applications based on a parallel, distributed processing architecture. In turn, this enables our customers to respond to each and every new record or piece of data as they arrive, in real time. So SQLstream is super low latency, unlike traditional data warehousing and Hadoop storage platforms, where data must be stored first and then queried. Even if you query faster, querying faster is not real-time.

Posted under Big Data · Real-time · Streaming SQL

Vice President Marketing
July 24, 2012

SQLstream has joined the Google Cloud Platform Partner Program as a Technology Partner with the release of our Continuous ETL connector for real-time Big Data integration with Google BigQuery. Continuous ETL solves the real-time performance issue for Big Data storage platforms. Although very capable of processing large datasets once the data has been stored, Big Data storage platforms are not designed to process real-time streaming data.

Continuous ETL for Google BigQuery

SQLstream’s Continuous ETL connectors enable SQLstream to integrate and analyze vast volumes of live, real-time Big Data, and to update the Big Data storage platforms immediately as input data arrives. With existing ETL solutions, the data warehouse and Big Data storage platforms are updated in batch mode and are therefore always out of date. Continuous ETL updates continuously and in real-time.

Real-time, integrated Big Data solutions

Grupo IntechThe continuous ETL integration was carried out by Grupo Intech as an integral component of a SQLstream and Google BigQuery solution for detecting road network traffic congestion in real-time using GPS data.

GPS records are collected from several specialist data providers. Each GPS record contains the position, direction and speed of a vehicle. SQLstream collects GPS records from all the providers in real-time and turns the live GPS data feeds into real-time traffic flow information, and by applying a variety of congestion detection algorithms, is able to generate a map of the current traffic congestion positions, plus additional information as to the extent or severity of the problem.

SQLstream calculates real-time traffic analytics and displays the results in real-time on a map-based display. Roads are color-coded based on the average traffic speed relative to the posted speed limits for each road segment. Colored push-pins appear on the highways to provide further information on the potential problem, such as the average speed for every minute over the previous 15 minutes.

Continuous ETL and Google BigQuery

So how does Big Query help? GPS data is of variable quality, and also, as the data is available from many different providers, the network coverage for each provider can differ significantly.

Google Big Query is used to store the real-time traffic information produced by SQLstream, and to generate a confidence index on the coverage and quality, and therefore the reliability, of the incoming GPS records. Effectively, Google Big Query is helping Grupo Intech answer the question – do we have sufficient valid data for the real-time congestion information to be reliable.

SQLstream’s Connector for Google Big Query is used to deliver real-time analyzed data using a Continuous ETL operation. With continuous ETL, streaming data is continuously acquired, cleaned, conditioned, transformed and then periodically appended to the BigQuery table, for example, every minute or every 500,000 events. Therefore the data in Big Query is always up to date and accurate.

Here we see the Big Query table, which holds a minimum of 3 months of enhanced and interpolated data for ad-hoc data mining. The period can be extended if needed for improved accuracy.

Continuous ETL – more than real-time

SQLstream’s analysis of the road network is based on individual 10m road segments. Typically GPS records are not available for every vehicle for each and every road segment, so SQLstream interpolates missing road segment data in real-time, and continuously appends those results as well into the Google Big Query table – storing a complete real-time map of traffic flow for the entire road network at a level of granularity of every 10 meters. Therefore unlike traditional ETL tools that tend to offer only a basic mapping function, SQLstream delivers significantly more information into Google Big Query than is available in the source data.

How does it work

SQLstream allows the user to zoom in and center the geographic map to an area of interest. This action defines a bounding box and the coordinates of the bounding box are used in this first Big Table query to extract the number of 10 meter road segments within the bounding box. The action of defining the bounding box in SQlstream  launches a query similar to the following example:

select count(*) from road_elements
where
  "reLatitude" between 7.7109920000000001 
                   and 9.8525100000000005 AND
  "reLongitude" between -65.754776000000007 
                    and -62.958756000000001.

This query retunes the number of road segments within the bounding box (in this example, there are 191,866 road segments) which is then plugged into the query executed by BigQuery in order to calculate the percentage of road elements for each GPS data provider actually used for calculating the road network coverage within the bounding box.

select vendorID, float(integer(count(*) / 191866 * 1000)/10) as Confidence
from (select vendorID, reID from sample.july2nd
where 
 float(reLatitude) between 7.7109920000000001 
                       and 9.8525100000000005 AND
 float(reLongitude) between -65.754776000000007 
                        and -62.958756000000001
group by vendorID, reID)
group by vendorID order by Confidence desc

The query results are returned to SQLstream which displays the answer to the user. Big Query is ideal for these types of geospatial coverage metrics. There is a high volume of meaningful data stored in the Big Query table and the calculation of coverage is available immediately. By choosing a variety of bounding boxes, it is possible to obtain coverage metrics a country-wide basis, on a metropolitan area basis, and on for a much smaller problem area.

Conclusions

In conclusion, we have demonstrated how continuous ETL is the basic building block for integrating real-time and historical data. Continuous ETL enables offline data storage to be kept up to date directly from real-time applications. And unlike conventional ETL solutions that simply map data between systems, SQLstream’s Continuous ETL connector for Google Big Query delivers aggregated data, analytics as well as augmented and interpolated data values.

 


Vice President Marketing
June 5, 2012

Glue Conference 2012 , Denver CO, at the end of May was a great conference, well attended, knowledgeable participants and is the only conference I know that looks at gluing cloud and mobile applications together with a developer focus.

There was the usual wave of NoSQL, cloud storage, cloud platforms and Hadoop presentations, as you’d expect, but also with some interesting keynotes as well. Ray O’Brien, CTO for IT at NASA. talked about the evolution of Nebula and OpenStack at NASA, and James Governor from Redmonk, talking about the evolution of historical analytics.

From our perspective, the strength of the show was in making physical rather than logical connections. Both partnerships and potential customer interest in building real-time Big Data applications, and how SQL has been repurposed as an API for streaming Big Data, moving it forward significantly from its roots as static data management language.

Real-time, streaming Big Data

Relational Streaming for real-time Big Data scaling

Relational Streaming for real-time Big Data scaling

Our CEO, Damian Black, presented on real-time streaming Big Data, both as a real-time alternative to Hadoop, and also as a complement to add real-time responses and streaming integration to existing Hadoop installations. One question we were asked several times was why SQL? A good question. This isn’t a religious debate about the language by any means, and if we had opted to build a Big Data batch storage and analytics platform (e.g. like Hadoop), we would have gone a different route.

However, when it comes to processing streaming tuples in real-time, a standard SQL approach has two big advantages over all others. First, with the extension of the SQL WINDOW operator to process streaming data over fixed time windows, both structured and unstructured data can be processed painlessly without having (no pun intended) to define a static schema and without the need for any coding whatsoever. In effect, SQLstream processes streams of arriving tuples over time windows and pushes out the results to other systems.  Similar in concept at least to Hadoop, although Hadoop is purely batch-based, processing static files and pipelining sets of tuples through low level Map-Reduce functions.

Relational Streaming for Real-time Big Data

Click on the title slide to view the real-time Big Data presentation

However, the second benefit is equally important. Streaming SQL queries include standard operators such GROUP BY and PARTITION. These provide the best clues possible to a query planner capable of automating the dynamic scaling of streaming pipelines over vast numbers of servers. This gives a reliable and controllable mechanism for Big Data scalability without the need for hardcoding server allocation hints.

Real-time at GlueCON

The strength of the real-time track at GlueCON was encouraging. It was interesting though that the term ‘real-time’ is now about as over used as ‘Big Data’, and about as poorly understood. For SQLstream, it’s the streaming integration of any and all data sources with in-memory analytics, processing streams at millions of events per second. For some other vendors, it appears real-time drops off at significantly lower rates and numbers of connections!

Next stop, GigaOM Structure in San Francisco

Next stop GigaOM Structure in San Francisco, June 20 / 21 at the Moscone Center. Visit us there if you’re attending.

Posted under Big Data · Real-time · Streaming SQL · Uncategorized