Archive for the ‘real-time industry solutions’ Category

Could real-time intelligence be the catalyst for industrial innovation?

Thursday, February 2nd, 2012

Today’s new world economy has manufacturers racing toward opportunities requiring growth through expansion and increased productivity while pricing remains flat. The increase in fuel, energy, raw materials and labor prices are offsetting scientific and technological advances applied to modern factory machinery, processes and the workforce.

Manufacturing automation technology solutions offer manufacturers monitoring and alerting applications improving plant manager oversight and response to quality, consistency and cost issues. While top 100 computer and software companies offer solutions in this space, finding a realistic positive ROI offering is daunting with many requiring huge investments in entire systems overhaul or replacement.

Innovation insights from the chemicals industry

Innovation insights from the chemicals industry, Tom Craren PricewaterhouseCoopers, LLC

More than any other sector, the chemicals industry is investing heavily in innovation to garner a competitive edge. Ninety-two percent of CEOs in this industry believe that innovation will lead to operational efficiencies and competitive advantage, 13 percent more than all CEOs surveyed.” By Tom Craren, PricewaterhouseCoopers, LLC. (See Figure)

Today’s IT systems are exorbitant purchases requiring a long-term commitment and a finite vision of volume and quality. Unfortunately these solution sets quickly become static when margins shrink and volumes must increase to continue operating in the red. Competitive solutions in this economic environment must show nearly immediate returns on investment by increasing output and improving quality. This requires a lighter and more powerful system that has the following traits.

  • Unlimited scalability
  • Seamless integration with current systems.
  • Low cost fast deployment.

Plant managers and engineers should consider a lightweight approach to their efficiency shortfalls rather than the hefty out of the box system overhaul which may give a pretty picture but not the tailored in depth analysis and alerting needed.

Envision a real time layer over existing systems currently in place. A real time data engine that stands alone aggregating unlimited amounts of disparate data, analyzing it “on the fly” without a database and delivering it to any device in any format for real time machine-to-machine and human response.

The real time data engine would also have unlimited scalability creating an ever growing solutions platform using standard database querying language. The flexibility and power of this automation platform allows for continuous upgrading of machinery, flow process and simultaneously integrates dated systems and disparate devices from diverse manufacturers.

Practical benefits of a real time data engine will include the following:

  • Real time big data processing and operational intelligence “on the fly”.
  • Real time data enhancement “on the fly”.
  • Real time historical comparatives and complex predictive analysis.

Real time operational decisions made in time by machines and humans will reduce downtime, improve quality and increase output.

Real-time QoS Monitoring for IP Services

Thursday, January 26th, 2012

QoS and service level monitoring has always presented a challenge for telecommunications companies. With the increase in uptake of IP voice and video services, the vast data volumes generated, and the lack of an end to end view, make monitoring the service experience in real-time increasingly difficult.

Real-time QoS Monitoring

Diagram 1: Real-time QoS Monitoring

In this blog I’m looking at the core building blocks of a real-time IP service monitoring solution by using a much simplified view of a real-time application. Diagram 1 illustrates the basic problem – how to monitor an IP service when the end to end view is only possible by piecing together large volumes of events from many different sources – the core network provider’s network, the home network and the cable modem, and the service providers platforms.

In SQLstream we capture each event stream in real-time. Applications are built as streaming pipelines – unlike a traditional database solution, where event data must first be stored and then processed, SQLstream streams the data through processing views, capturing, combining, filtering, aggregating and applying analytics to the events streams, without having to store the data.  This enables real-time operational intelligence with extremely high volume performance with very low latency.

The first views in the pipeline capture the data streams. A declaration for an external data feed is shown below, the real-time MyEvent stream, where source of events is the external system agent or integration adapter.


   CREATE OR REPLACE STREAM MyEvent (
       "eventName"    VARCHAR(10),
       "eventSeq"     BIGINT,
       "eventVal1"    INTEGER,
       "eventVal2"    SMALLINT,
       "eventVal3"    BIGINT
   ) DESCRIPTION 'source of events'; 

The raw MyEvent data is first filtered, searching for the events of interest. As illustrated in the code example below, these initial views tend to be as simple as possible in order to maximize reuse – the simplest being a SELECT STREAM * FROM WHERE statement. Streams can be combined, grouped or joined in a single view, or a single view provided per stream, or both.


   CREATE OR REPLACE VIEW RawEvents AS
        SELECT STREAM *
        FROM MyEvent
        WHERE "eventName" = 'RawEvent';

Diagram 2 illustrates the concept of the streaming data pipeline, using a simplified example for exception detection. The SQL view illustrated above is the Stream Capture #1 view in the diagram. The use case is built on a real world example, raising an exception if a number of events of a particular type or value are detected within a specified time window.

Real-time Stream Processing Pipeline

Real-time Stream Processing Pipeline

The second view in the pipeline, Stream Processor #1, is shown below. In this example the view is responsible for the basic processing of the stream, counting the number of events that occur within a time window, in this case 180 seconds.


  CREATE OR REPLACE VIEW CountedEvents AS
   SELECT STREAM
      *,
      COUNT("eventName") OVER win AS "eventCount",
      FIRST_VALUE(RE.ROWTIME) OVER win AS "firstEventTime",
      FIRST_VALUE("eventSeq") OVER win AS "firstEventSeq"
      FROM RawEvents AS RE
   WINDOW win AS (RANGE INTERVAL '180' SECOND(3) PRECEDING);

The final stage in this particular processing pipeline is the detection of the alert.

  CREATE OR REPLACE VIEW FlagTriggerEvents AS
      SELECT STREAM *,
          "eventCount" >= 3 AS "alert"
      FROM CountedEvents;

It would of course be possible to include all processing in a single view. However, maximizing reuse of views is a major consideration when building a stream processing application. The example is to illustrate how a pipeline can be constructed, where each view can have any number of consumers. For example, any number of Rule views can read from the Stream Processor #1 view, and any number of views can read directly from the stream capture view.

The application includes significantly more sophisticated integrations, features and analytics than illustrated here. For example:

  • Multiple rules
  • Recording and forwarding the events responsible for the generation of the alerts
  • Detect escalation
  • Detect clearance events
  • Join with alert history to identify exceptional events that deviate significantly from historical norms

These use cases are important components of a complete solution, and I’ll be providing examples in subsequent blogs, explaining how these have been implemented.

SQLstream to announce new product for real-time Intelligent Transportation at ITS World Congress, Orlando.

Wednesday, October 12th, 2011

Visit SQLstream on Booth #1366

Technology and innovation are central themes of this year’s ITS World Congress.  There’s been much written about the issues of congestion, green transportation schemes and improving personal mobility, not least in this blog.  At SQLstream we’ve been playing our part to help revolutionize the Intelligent Transportation industry.  It’s clear that the concepts of streaming data and real-time analytics are entering the main stream – from low level Big Data toolkits that require a streaming, low latency front end, to the real world of sensor networks and industries such as smart grid and telecommunications.

This is just as true in transportation.  Here we have an industry with vast volumes of sensor data, a need for sophisticated real-time analytics, and platforms capable of driving real-time process automation.  We’ve been working with a number of transportation agencies for some time, and are about to launch a new ‘Insight’ product for intelligent transportation.  Our ‘Insight’ range provides tools and out of the box support for specific industry verticals based on our core Stream-to-Business platform.

Google Earth Display for Road Traffic Congestion

Google Earth Display for Road Traffic Congestion

For Intelligent Transportation this means processing sensor data from GPS and fixed-road sensors, to deliver applications such as real-time Travel Time, live congestion detection and network KPI reporting.

Should you be attending the ITS World Congress, we’d be delighted to see you on our booth (#1366) for a demonstration.

SQLstream, Intelligent Transportation and ITS World Congress

Tuesday, September 13th, 2011

ITS World Congress, 2011. Visit SQLstream, Booth #1366

The 18th World Congress on Intelligent Transport Systems (ITS) is being held in Orlando from October 16th – 20th, 2011. This is the leading event for intelligent transportation solutions, and attracts a large audience of government, technology and industry professionals. The event seeks to demonstrate advances in the application of new technology and smart transportation. Major areas of focus include the reduction of traffic congestion and improvement in  personal mobility.

With 800 million vehicles on the world’s roads today, a number forecast to grow to between 2 and 4 billion by 2050, it is clear that transportation management  systems will need to analyze real-time sensor and GPS data dynamically on a massive scale to reduce congestion and optimize personal mobility. The objective is to achieve a fluid and reliable transportation network, that can respond dynamically to changing loads and conditions, and provide consistent and acceptable travel times.

The performance of a transportation network can be measured based on road usage (number of vehicles), and the travel speed and time from origin to destination.  Today’s traffic management systems rely on historical analysis of data from fixed sensors.  However, roadside and in-road sensor projects are very expensive to install and maintain. As a consequence, only a very limited view of the overall road network is available,  with sensor deployments focusing on primary routes and major intersections only. Also, fixed sensors tend to report traffic flow – at best a secondary measure of the real requirement –  congestion.

Most important however is the lack of real-time, dynamic behaviour from existing traffic management systems.  Flow control, for example at intersections and on freeways, is activated at specific times based on the historical analysis of the fixed sensor data – this helps, but is unable to react to changing patterns of traffic flow and congestion.

One approach to the problem is to introduce the latest wireless GPS sensor technology.  Wireless GPS sensors have two significant advantages:

  1. Immediate and real-time information on vehicle speed and location.
  2. Low cost solutions that can be deployed quickly, with little or no maintenance.
  3. Provides a direct measure of vehicle speed and the ability for real-time and accurate measure of congestion.
  4. Complete network insight – highways and arterial routes – at the granularity of a few meters.
SQLstream ITS Insight

SQLstream ITS Insight (Click to enlarge)

For example, when the  Roads and  Traffic Authority (RTA) for New South Wales in Australia was re-evaluating its approach to intelligent transportation systems, it identified wireless GPS technology as both a significantly cheaper and potentially much superior solution to congestion detection and Travel Time.  The RTA selected SQLstream as the real-time traffic analytics and congestion detection platform based on processing in-vehicle GPS sensor data. The SQLstream solution enabled the RTA to cancel a $20million fixed sensor program,  and to build a real-time traffic management platform based on SQLstream’s ITS Insight.

We will be demonstrating our real-time traffic management capabilities on our stand at ITS World Congress in Orlando.  In addition, our CEO, Damian Black will be participating in a number of related panel sessions on arterial travel time solutions and real-time data management for intelligent transportation.  For those attending ITS World Congress, please visit us for a demo at Booth #1366, or visit our website for more information on SQLstream and real-time transportation management systems.  We look forward to seeing some of you at least some at the show.

Real-time congestion detection with Streaming SQL

Wednesday, August 31st, 2011

I am going to discuss a SQLstream application for monitoring traffic flow in real-time. In this application, vehicles with GPS enabled devices transmit vehicle position along with other vehicle information such as speed and engine state. SQLstream receives this information as a real-time data stream and uses streaming SQL analytics to detect and predict the rapid onset of congestion on the road network in real-time.

Streaming SQL for Congestion Detection
The SQLstream application for congestion detection uses a typical streaming SQL processing pipeline. In this case, data is fed into the SQLstream pipeline using our Log File Adapter. SQLstream adapters provide an interface to sources and targets such as databases, log files, network sockets and mail servers. Adapters are built using SQL/MED specification which is part of ANSI SQL standard. In this application, each log file contains the vehicle positions on the road network for the latest minute.

Streaming SQL Pipeline for Real-time Traffic Congestion Detection (click to enlarge)

The conditioning pipeline performs data cleansing operations such as rejecting poor quality data (records with missing or out-of-bounds columns) followed by mapping of vehicle positions (lat/long pair) to a “road element” of the road network using a UDX to perform geo-spatial lookups in an external road network database.

The diagram and the example SQL below show our implementation of a streaming SQL pipeline for congestion detection. Each vehicle reports its position and speed every minute. Two consecutive vehicle positions are then used to interpolate vehicle speeds for each road element on the vehicle path between reporting positions. The interpolated speed is based on actual distance traveled by the vehicle between two consecutive reports. The interpolated speed is calculated in a User Defined Transform(UDX). The UDX is written in Java. The UDX also associates a confidence factor with each interpolated speed value based on the position of the road element relative to endpoints of the vehicle path.

Streaming Traffic Flow Analytics
As illustrated below, the analytics pipeline calculates 15, 5, 4, 3, 2 & 1 minute moving average speeds for each road element. Each road element is color coded based on the 15-minute moving average speed. The results are streamed to a Google Earth display.

CREATE OR REPLACE VIEW “EstimatedReSpeeds” AS
SELECT STREAM “RE”, “reID”, “Carriageway”, “rePrescribed”, “reSpeedLimit”,
++SUM(“reVehicles”) OVER “last1″ AS “reVehiclesLast1″,
++SUM(“reVehicles”) OVER “last2″ AS “reVehiclesLast2″,
++SUM(“reVehicles”) OVER “last3″ AS “reVehiclesLast3″,
++SUM(“reVehicles”) OVER “last4″ AS “reVehiclesLast4″,
++SUM(“reVehicles”) OVER “last5″ AS “reVehiclesLast5″,
++SUM(“reVehicles”) OVER “last15″ AS “reVehiclesLast15″,
++SUM(“reSpeed” * “reConfidence”) OVER “last1″ /
++SUM(“reConfidence”) OVER “last1″ AS “reSpeedLast1″,
++SUM(“reSpeed” * “reConfidence”) OVER “last2″ /
++SUM(“reConfidence”) OVER “last2″ AS “reSpeedLast2″,
++SUM(“reSpeed” * “reConfidence”) OVER “last3″ /
++SUM(“reConfidence”) OVER “last3″ AS “reSpeedLast3″,
++SUM(“reSpeed” * “reConfidence”) OVER “last4″ /
++SUM(“reConfidence”) OVER “last4″ AS “reSpeedLast4″,
++SUM(“reSpeed” * “reConfidence”) OVER “last5″ /
++SUM(“reConfidence”) OVER “last5″ AS “reSpeedLast5″,
++SUM(“reSpeed” * “reConfidence”) OVER “last15″ /
++SUM(“reConfidence”) OVER “last15″ AS “reSpeedLast15″
FROM “Stage3″
WINDOW “last1″ AS (PARTITION BY “RE”
++RANGE INTERVAL ‘1′ MINUTE PRECEDING),
+++++“last2″ AS (PARTITION BY “RE”
++RANGE INTERVAL ‘2′ MINUTE PRECEDING),
+++++“last3″ AS (PARTITION BY “RE”
++RANGE INTERVAL ‘3′ MINUTE PRECEDING),
+++++“last4″ AS (PARTITION BY “RE”
++RANGE INTERVAL ‘4′ MINUTE PRECEDING),
+++++“last5″ AS (PARTITION BY “RE”
++RANGE INTERVAL ‘5′ MINUTE PRECEDING),
+++++“last15″ AS (PARTITION BY “RE”
++RANGE INTERVAL ‘15′ MINUTE PRECEDING);

Detecting the rapid onset of congestion
Congestion is detected by comparing moving averages for the larger time window with that for the smaller time window. For example, comparing a 2-minute average with a 1-minute average:

CREATE OR REPLACE VIEW “CongestionRule1″ AS
SELECT STREAM
++–- name, ID, highway name, speed limit etc. for each road element
++“RE”, “reID”, “Carriageway”, “rePrescribed”, “reSpeedLimit”,
++–- volume of vehicle reports in each time window
++“reVehiclesLast1″, “reVehiclesLast2″, “reVehiclesLast3″,
++“reVehiclesLast4″, “reVehiclesLast5″, “reVehiclesLast15″,
++–- estimated avg speed for each road element
++“reSpeedLast1″, “reSpeedLast2″, “reSpeedLast3″,
++“reSpeedLast4″, “reSpeedLast5″,”reSpeedLast15″
FROM “EstimatedReSpeeds”
WHERE “reSpeedLast1″ < 0.80 * “reSpeedLast2″ AND – slowdown by 20 %
++“reSpeedLast2″ < 0.80 * “reSpeedLast3″ AND
++“reSpeedLast3″ < 0.80 * “reSpeedLast4″ AND
++“reSpeedLast4″ < 0.80 * “reSpeedLast5″ ;

SQLstream Traffic Congestion Detection - Visualization

SQLstream Traffic Congestion Detection - Visualization (click to enlarge)

Note that these estimated speeds are over overlapping windows and as such slowdown thresholds are set accordingly.

Fine tuning slowdown thresholds and other information, such as the proximity of traffic lights and the volume of vehicle reports in each time window, improves the quality of congestion detection algorithm.

The Google Earth screenshot illustrates real-time traffic view as well as detected slowdowns as pins. The severity of the slowdown is indicated by different shades of red.

Real-Time Seismic Monitoring in the Cloud with SQLstream

Wednesday, July 20th, 2011

SQLstream is helping to predict earthquakes across the world in real time. The system has been developed by a consortium of universities and government agencies, with funding from NSF (National Science Foundation), to provide an infrastructure of networked tools for research in ocean science – constructing an internet-based system to collect and share data.

Ocean Research Program Overview

Real-time Seismic Monitoring Program

This is a large system with 16,000 land and sea-based sensors, each sending several channels of seismic event data at a rate of 40 data points per second. The application executes in an Amazon EC2 Cloud on a cluster of SQLstream servers, connected by an AMQP message bus. SQLstream’s AMQP adapter (built with RabbitMQ) enables the streaming SQL application to view the AMQP bus as a domain of input and output data streams. The initial prototype was upgraded to a full-scale system running on a cluster of servers without any changes to the streaming SQL application.

SQLstream’s contribution to the project, an application that processes seismographic data in real-time, demonstrates:

  1. An operational deployment of streaming SQL for scientific calculations in real-time
  2. Real-time distributed processing in an Amazon EC2 cloud using SQLstream and AMQP
  3. Rapid development and rollout of real-time data applications using standards-based streaming SQL.

Seismic Monitoring
The sensor network contains about 16,000 seismic sensors, organized into grids, covering large parts of the North American continent and the adjacent oceans (see illustration below). Each sensor measures the motion of the ground under it in three dimensions, and transmits its data as several digitized channels, typically sampled 40 times a second.

Seismic Sensor Map

Seismic Sensor Map

While the rate of each signal channel is modest (since seismic waves are low-frequency),
this adds up to a large amount of data to process in real time. Moreover, the rules for detecting a seismic event are heuristics that apply to a time interval of several minutes: so the application has to calculate some quantities from the raw data and to store these calculated values over a time window.

But to detect an earthquake reliably, it’s better to monitor all the sensors at once, looking for a disturbance in the signals that first appear in one place, and then appear in nearby places: a disturbance signal that propagates and changes shape in a way consistent with the physics of a seismic wave.

Real-time sensor network management in an EC2 Cloud
Monitoring tens of thousands of signal channels arriving at 40 sample points per second is a complicated problem, but it can be made simpler by breaking it into stages. We’re interested in earthquakes, which are infrequent, so SQLstream first reduces the amount of data by scanning each channel for patterns that suggest the beginning, the peak or the end of a quake: in other words reduce the dense signal to a sequence of interesting events. Then we can look for events detected on other channels that could be due to the same quake propagating in physical space and time.

In the first phase, we built a real-time seismic event detector in streaming SQL.
We translated a scientific algorithm into streaming SQL, and connected to the scientific sensor data infrastructure – using our AMQP adapter. This involved less than 100 lines of streaming SQL.

In the second phase, the prototype was scaled up to a full-size system, dealing with 16,000 sensor channels, running on multiple SQLstream server nodes created automatically in an Amazon Elastic Cloud. This expansion required no changes to the streaming SQL application developed for the initial prototype – simply running the same streaming SQL application inside an elastic container/manager.

Real-time Event Detection with Streaming SQL
The sensor data processing pipeline has five functional stages:

  1. Reading Messages – over the AMQP adapter. A configuration parameter specifies which “topics” SQLstream subscribes to, that is, which set of sensor channels it receives.
  2. Unpack Data – a user defined transform (UDX) unpacks data channel messages into individual data points.
  3. Signal Processing – extract higher order information from the raw data to identify seismic events given background noise and non-seismic ‘bumps’. An example function would be to calculate the ratios of multiple rolling averages of the signal value (x) over different time windows.

    Plot of Seismic Events

  4. Event Detection – The next stage applies heuristic rules to the processed data streams – the output is much sparser: a stream of significant events, each indicating a possible start/peak/stop of a seismic wave on a particular channel. If events of the correct type occur within a correct interval of each other (as shown in the Signal Plot Diagram), they are accepted as significant.
  5. Writing Messages – output (publish) significant events over the AMQP adapter.

Automatic EC2 scaling for Big Data sensor volume
The pipeline described has been proven to scale well to handle even greater data volume:

  1. Channels are processed independently, so scale as O(N).
  2. Additional channels are processed by adding more pipelines
  3. Each pipeline starts and ends with AMQP messages – the SQLstream / AMQP interoperability has been shown to scale well.
  4. To add a pipeline, we simply add another Elastic Compute server, running the same pipeline streaming SQL, but configured to subscribe to its own set of sensor channels.

This is the first part of a series of blogs describing the seimic monitoring solution. The next blogs will focus on the streaming SQL used in the application, and the SQLstream / AMQP architecture for Big Data scalability.

SQLstream and Mozilla Firefox 4, a look at the SQL behind the scene

Wednesday, March 23rd, 2011

Since Tuesday’s announcement that the Firefox Download Monitor is powered by SQLstream, we’ve received a number of questions about how it all fits together. I hope this description helps answer some of those questions.

Mozilla Firefox Real-Time Download Monitor - Day 2

Mozilla's Firefox Real-Time Download Monitor - Day 2

SQLstream server executes SQL statements, just like standard SQL, except the SQLstream’s queries run continuously, analyzing input data in real-time as it arrives. Statements are presented via JDBC or user friendly tools which use JDBC internally. Statements are compiled/prepared, the planner/optimizer chooses an access plan, and a runtime engine executes the plan. SQLstream is compliant with SQL 2008 and 2003 with just a couple of extensions. One extension includes the keyword STREAM as part of a SELECT statement. The STREAM keyword indicates that the results are continuously streaming rather than a point in time TABLE.

Applications in SQLstream are constructed out of a set of SQL CREATE STREAM statements and SQL VIEWS against streams and other views. Those statements are assembled into a pipeline. When describing a pipeline we refer to statements on the source side as being upstream, and statements closer to the destination as being downstream.

In the middle of the pipeline, we define a stream named FirefoxDownloadStream_ which contains the results of the parsed and conditioned download events. The stream declaration is identical to a table definition with the exception of the type of the object being a STREAM rather than TABLE.

CREATE STREAM "FirefoxDownloadStream_" (
+++"download_type"++++++++++VARCHAR(15),
+++"utc_timestamp"++++++++++TIMESTAMP,
+++"product_name"+++++++++++VARCHAR(12),
+++"product_version"++++++++VARCHAR(12),
+++"product_major_version"++VARCHAR(12),
+++"product_os"+++++++++++++VARCHAR(10),
+++"locale_code"++++++++++++VARCHAR(5),
+++"country_code"+++++++++++VARCHAR(2),
+++"city_name"++++++++++++++VARCHAR(32),
+++"region_code"++++++++++++VARCHAR(2),
+++"longitude"++++++++++++++VARCHAR(8),
+++"latitude"+++++++++++++++VARCHAR(8)
);

The stream is populated with a SQL INSERT-SELECT statement. Again, standard SQL statements are used. The WHERE clause defines that the downloads include new first time downloads, complete upgrades of prior versions of Firefox, or partial upgrades of prior versions of Firefox.

INSERT INTO "FirefoxDownloadStream_"
++("download_type",
+++"utc_timestamp",
+++"product_name",
+++"product_version",
+++"product_major_version",
+++"product_os",
+++"locale_code",
+++"country_code",
+++"city_name",
+++"region_code",
+++"longitude",
+++"latitude"
++)
SELECT STREAM
+++"dlType"+++AS "download_type",
+++"dlTime"+++AS "utc_timestamp",
+++"product"++AS "product_name",
+++"version"++AS "product_version",
+++"GetMajorVersion"("version") AS "product_major_version",
+++"os"+++++++AS "product_os",
+++"lang",++++AS "locale_code",
+++"cc",++++++AS "country_code",
+++"city",++++AS "city_name",
+++"rg",++++++AS "region_code",
+++CAST("latitude" AS VARCHAR(10)) AS "latitude",
+++CAST("longitude" AS VARCHAR(10)) AS "longitude"
FROM "FirefoxCountryFilter"
WHERE (("dlType" IS NULL) OR ("dlType" = 'complete') OR ("dlType" = 'partial'));

The download events contain the time of each download. Mozilla has a number of download servers feeding the worldwide requests to download Firefox. Each of these servers feeds the results of the download requests to a common logfile which is “tailed” by SQLstream. As the time for each download differs due to each client’s network capacity, the download requests may be slightly out of order. In practice the biggest gap we’ve seen is 4 seconds. Since we’re measuring downloads over the long period of time, it was deemed sufficient to adjust the download time of late arrivals to match the most recent download time.
The following SQL statement does that adjustment.


CREATE OR REPLACE VIEW "FirefoxDownloadStream" AS
+++SELECT STREAM MAX("utc_timestamp") OVER(ROWS UNBOUNDED PRECEDING)
++++++++++AS ROWTIME,
++++++++++*
+++FROM "FirefoxDownloadStream_";

SQLstream associates a ROWTIME with each row in a STREAM. The ROWTIME is a monotonically increasing SQL timestamp. In the default case, the ROWTIME is the current time expressed in UTC. Most applications require time to be defined according to time associated with the data itself. Associating the ROWTIME of a row in a stream based on the data contents of the row, is done by the AS ROWTIME clause for an individual column. In the Mozilla pipeline, we set the ROWTIME to be the maximum of the values in the “utc_timestamp” column to be that rows ROWTIME.
The analytics portion of the pipeline is implemented with a standard SQL statement. For example, each 10 seconds the number of downloads for each product, version, … country, city, region is calculated.


CREATE OR REPLACE VIEW "FirefoxStreamForLocationCounters"
DESCRIPTION 'Compute product counters for a minute' AS
+++SELECT STREAM
++++++++++"download_type",
++++++++++"product_name",
++++++++++"product_major_version",
++++++++++"product_version",
++++++++++"country_code",
++++++++++"region_code",
++++++++++"city_name",
++++++++++"latitude",
++++++++++"longitude",
++++++++++count(*) AS "count"
+++FROM "FirefoxDownloadStream" F
+++GROUP BY FLOOR(F.ROWTIME TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '10' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '20' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '30' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '40' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '50' SECOND TO MINUTE),
++++++++++++"product_name",
++++++++++++"download_type",
++++++++++++"product_major_version",
++++++++++++"product_version",
++++++++++++"country_code",
++++++++++++"region_code",
++++++++++++"city_name",
++++++++++++"latitude",
++++++++++++"longitude";

There is a similar view declaration where similar calculations are done for each product. Most of the interest since Tuesday is of course related to Firefox 4.0 downloads. This second view allows Mozilla to drill down on downloads by platform as well as downloads for previous (and future) Firefox versions.


CREATE OR REPLACE VIEW "FirefoxStreamForProductCounters"
DESCRIPTION 'Compute product counters for a minute' AS
+++SELECT STREAM
++++++++++"download_type",
++++++++++"product_name",
++++++++++"product_major_version",
++++++++++"product_version",
++++++++++"product_os",
++++++++++count(*) AS "count"
+++FROM "FirefoxDownloadStream" F
+++GROUP BY FLOOR(F.ROWTIME TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '10' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '20' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '30' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '40' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '50' SECOND TO MINUTE),
++++++++++++"product_name",
++++++++++++"download_type",
++++++++++++"product_major_version",
++++++++++++"product_version",
++++++++++++"product_os";

Each of these views (FirefoxStreamForLocationCounters and FirefoxStreamForProductCounters) is based on the FirefoxDownloadStream. Each defined stream and view is a point where the application can access data either directly or via another VIEW or INSERT…SELECT.

One component of the solution is a piece of code we call the HBaseAgent. The agent uses the JDBC interface to SQLstream and issues a SELECT * FROM each of the described views containing the location and product counter 10-second download counts. The HBaseAgent maps each fetched row to the HBase schema as defined by Mozilla.

I write this blog about 24 hours after Firefox 4 launched. So far there are more than 8 million downloads of Firefox 4. It certainly has been an exciting day for Mozilla and I congratulate everyone who contributed. I’m happy that SQLstream has been able to contribute to their success.

SQLstream and Mozilla Firefox 4.0

Tuesday, March 22nd, 2011

SQLstream has been powering Mozilla’s Firefox Download Monitor since 2009. A SQLstream based application has been continually aggregating hundreds of millions of download events, receiving minute by minute aggregations via a continuously running SQL SELECT statement using the SQLstream JDBC driver. A continuously running SELECT statement is syntactically and semantically identical to other SELECT statements with the addition that end of data is never returned in SQLSTATE by the FETCH associated with the cursor.

Mozilla 4.0 Real-Time Download Monitor is Powered By SQLstream

Mozilla 4.0 Real-Time Download Monitor is Powered By SQLstream

For the launch of Firefox 4, Mozilla again turned to SQLstream to enhance the download monitor. Applications in SQLstream are built by defining a series of SQL stream definitions and SQL views. Business rules are embedded in these definitions. Definitions are assembled into a pipeline and each definition provides a point where data is available to applications. A stream definition is analogous to a SQL table definition and contains the column names and data types for each defined element. Each stream has an implicit ROWTIME column, a monotonically increasing value associated with the data in each column.

The download monitor tails the log files written by all of Mozilla’s download servers which provide new versions of Firefox. Each entry in the log is parsed. The IP address of each download is converted to a country, city, region, latitude, and longitude. For the Firefox 4 release, Mozilla wanted the results of the download aggregation to be stored in an HBase table in their HBase/Hadoop cluster. Storing the data allows future historical analysis to complement the realtime analysis provided by SQLstream.

SQLstream aggregates downloads to two separate column families in a single HBase table. The ‘product’ column family contains the overall download count. The ‘location’ column family contains the count of downloads for each country, region, city, latitude, longitude.

SQLstream uses a GROUP BY clause along with a COUNT(*) to calculate the number of downloads for each 10 seconds. The author wrote a new piece of SQLstream code which provides an interface from SQLstream to HBase. The HBaseAgent maps the results of the GROUP BY and calls the HBase API to persist data in HBase. The incrementColumnValue API is key in that it allows SQLstream to aggregate download counts on a realtime basis and efficiently update HBase by providing incremental values.

The application periodically reads data from HBase and sends formatted data to each connected browser. See for yourself at http://glow.mozilla.org/. The map provides a running count of Mozilla Firefox 4.0 download with raindrops on the map indicating each location where one or more downloads have just occurred. As I write this blog entry, I can see that Europe and Asia are hot while North America is just waking up. Clicking on the colored rings in the lower left hand corner of the map, allows drill down to the geographic locations.

Mozilla’s Daniel Einspanjer has also blogged about their new real-time download vizualization application. The blog explains the overall architecture of the real-time application using SQLstream, and the SQLstream integration with HBase.

You can read more about the previous Firefox 3 download monitor on Julian Hyde’s blog. Julian is the CTO of SQLstream.

Mozilla also blogged about the history of the Firefox 3 download monitor on the Mozilla Webdev blog.

SQLstream at PostgreSQL Conference: West 2010 (2)

Monday, November 8th, 2010

PostgreSQL Conference: West 2010Attendance at the PostgreSQL West 2010 Conference was encouraging considering a million people had gathered in the city to celebrate the World Series victory of the San Francisco Giants.

I’ve posted the presentation from the event (previously blogged about here) as given by myself and SQLstream’s chief architect Julian Hyde.

We presented the concepts of Streaming GIS, integrating SQLsteam’s real-time streaming data analytics with the PostgreSQL-based Geographic Information Systems (GIS) engine PostGIS. With examples from SQLstream’s commercial traffic congestion monitoring application, we discussed how sophisticated high performance real-time geospatial applications can be delivered quickly and easily using standards-based SQL.

Click here for a non-slideshare PDF version.

SQLstream at PostgreSQL Conference: West 2010

Tuesday, November 2nd, 2010

PostgreSQL Conference: West 2010SQLstream’s chief architect Julian Hyde and founding engineer Sunil Mujumdar are to present at PostgreSQL Conference: West 2010.

In a talk entitled Streaming GIS using PostGIS and SQLstream, Julian and Sunil will describe the SQLstream stream computing platform based on industry-standard SQL, and its integration with the PostgreSQL-based Geographic Information Systems (GIS) engine PostGIS.

Wireless sensors and internet services are generating data faster than conventional database technologies can process that data. In particular, mobile resource management requires the stream processing of high volume, location-based data. Solutions require new methods such as Streaming SQL to address these new sources of big data, in real-time. We’ll be discussing key concepts in stream computing, data warehouse feeds and the integration of PostGIS into a high performance streaming environment.

Using mobile resource management as a case study, we will illustrate with examples from SQLstream’s commercial traffic monitoring application.

Can GPS solutions really monitor parolees in real-time?

Tuesday, August 10th, 2010

A recent San Francisco Chronicle article described using GPS anklets to track former gang members on parole, expanding a program first used on paroled sex offenders.

The concept is great: if you know where parolees are, you can make sure they don’t violate parole restrictions (or catch them if they do). But the technology doesn’t always achieve the goal, for the simple reason that someone has to be tracking the anklets.

Technology can also help solve the problem. Standard data warehousing practices can tell authorities if someone violated parole yesterday, but having the information at the moment parole violations are occurring enables reacting quickly and decisively. In some cases, GPS time-&-location data can prevent crimes as well as aid in solving them after the fact.

Real time data analytics can do this by monitoring the information in real time and sending a page or other alert to parole officers as soon as a violation occurs.  At SQLstream, we’ve been working with customers to monitor real-time data from such sensors, providing instant real-time reports and alerts against pre-determined boundaries of time or space.

It’s also a growing market across the globe. In one state in Germany, ankle bracelets for monitoring time or location boundaries assigned to offenders on probation appear effective, helping probationers stick to a regular schedule, among other benefits.

The business case for better technology is driven by the potential for huge cost savings.  GPS anklet solutions cost only one-third what incarceration costs: about 33 euros per day versus about 100 euros for a day in jail (about $44 and $133, respectively). In California, using numbers from the Chronicle article and the California Department of Corrections and Rehabilitation, GPS ankle monitoring costs about $26 per day, while jail time averages about $133.

Such devices are also in use in other American communities, and German State justice ministers were scheduled to meet June 30th to discuss implementing the bracelets in other parts of Germany.

So it looks like GPS anklets for dangerous parolees may be here to stay, and with a step change in the supporting monitoring technology, true real-time analysis and reporting of exceptions and corrective action can be a reality.

Streaming Sensor Data

Thursday, July 29th, 2010

Railroads have used track side readers to scan bar codes on the sides of freight cars since the 1970s. Such sensors provided real time tracking of goods as they made their way from the supplier to the delivery point. Retail businesses increased the use of RFID tags in the past 20 years to track goods through the manufacturing process. Since the Indian Ocean tsunami of December 2004 the public has become aware of deep water pressure sensors which sit on the ocean floor to detect tsunamis and are intended to generate warnings about potential disasters.

The cost of sensors has decreased significantly in recent years and as a result inexpensive sensors are present nearly everywhere in businesses. As the price of sensors decreases it becomes economically feasible to deploy thousands and even millions of sensors. Such sensors cumulatively generated huge volumes of data. Imagine placing a sensor capable of measuring temperature, humidity, sun light and air pressure sensor within each square kilometer in the state of Iowa to assist farmers in managing crop production. Now imagine each of those 145,743 sensors generating 100 bytes of data every minute resulting in a data volume of nearly 21GB per day.

There is much buzz about Big Data and the challenges of applying traditional database management tools to extract business value from such data. Fortunately, there is a better way – integrating real time data, as provided by sensors, with stream analytic processing, allows timely enterprise decisions in response to changing conditions.

I urge you to read Damian Black’s recent postings on this blog describing the SQLstream approach to “Big Data”.

(more…)

Intelligent Transportation and the ITSA Conference

Wednesday, June 2nd, 2010

Just back from the 2010 Intelligent Transportation Society of America’s Annual Meeting.  For those unfamiliar with intelligent transportation, I am not referring to the “shovel ready” projects that have been funded by President Obama as part of the economic stimulus package. These projects were designed to spend money and create jobs, thereby, stimulating the economy. Unlike the federal “shovel ready” projects, “network ready” intelligent transportation technologies and projects are rapidly being adopted and implemented by local and state departments of transportation that must still operate under fixed or reduced budgets. These local and state DOTs are using new technologies to “Do More with Less.”

ITSAIntelligent Transportation aims to reduce costs, delays, pollution, injuries and deaths by connecting infrastructure control and monitoring systems to the network and enabling these systems, and their operators, to communicate in real-time. Some examples of intelligent transportation solutions and control systems include dynamic speed limits that change according to traffic and road conditions, stop lights that know when you can go and the FasTrak electronic toll system that reduces congestion on the Golden Gate Bridge and other Bay Area bridges. Real-time technology is essential if these dynamic control systems are to collect your toll at 45 miles per hour or detect when it is safe to proceed through an intersection.

All of these intelligent transportation systems and devices can be thought of as “sensors” on the network. The data is collected by the sensors, streamed to a server, analyzed and eventually stored in a warehouse. (Imagine the final scene from Raiders of The Lost Ark, except with crates full of hard drives). Meanwhile, the analytic results are communicated back to the original sources (stop lights, toll booths and electronic road information signs) as well as to the mobile devices in your vehicle.

In some cases, new intelligent transportation solutions need to be integrated with legacy systems. In other cases, they simply need to be able to talk to each other. Thus, it becomes imperative that all new intelligent transportation solutions be built on a set of common, open standards. In the long run, solutions built on open standards reduce the total costs to those who implement and maintain the solutions. Open standards, and in particular, the global use of open data standards, within the intelligent transportation industry is essential, not just so that different sensors on the network and IT solutions can communicate with each other, but so that drivers can experience consistent and safe journeys as they cross from federal highways to state and local roads, always in contact with intelligent transportation systems that control these roads.