Posts Tagged ‘open standards’

Streaming SQL and Bollinger Bands

Thursday, June 10th, 2010

Last year has been an interesting experience as I participated in a number of customer “Proof Of Concept” projects for SQLstream. Developing these real-time, stream computing projects greatly increased my appreciation for the advantages of an open, extensible and standards-compliant middleware infrastructure.

For example, I needed to implement an “edge detection” mechanism for a POC project. My colleagues at SQLstream recommended using “Bollinger bands” for determining outliers. So, I browsed through the  wikipedia entry for Bollinger Bands to learn more. Bollinger bands are very similar to standard deviations or quartile deviations. A Standard deviation measures variability or dispersion in data distribution. Bollinger bands, on the other hand, provide thresholds to filter outliers in the data. In fact, Bollinger bands are based on the moving average and moving standard deviation of the data set. For typical data sets, Bollinger bands can be defined as:

lowerBB(lower Bollinger Band) = avg – (k * stddev),

upperBB(upper Bollinger Band) = avg + (k* stddev)

where avg and stddev are the average and standard deviation over a sufficiently large time window and k is the constant that needs to be determined for the activity being monitored. For typical data sets, k = 2 will create the upper bollinger band at 95th percentile of the data set.

Bollinger Bands are widely used in the financial services industry. However, Bollinger Bands can be applied to solve problems in other industries. (As I am not claiming to be a statistics expert, I would certainly appreciate honest feedback on our application of Bollinger bands in streaming queries.)

Bollinger bands certainly are a good tool to identify sudden spikes in the activity being monitored in real-time. A number of examples come to my mind,

  • Sudden spikes in the price for a ticker symbol in a stock exchange. For example,

SELECT STREAM ROWTIME, ticker, price,

FROM (SELECT STREAM ROWTIME, ticker, price,

AVG(price) OVER (PARTITION BY ticker RANGE INTERVAL ‘1′ HOUR PRECEDING) AS “avgLastHour”,

STDDEV(price) OVER (PARTITION BY ticker RANGE INTERVAL ‘1′ HOUR PRECEDING) AS “stdDevLastHour”,

AVG(price) OVER (PARTITION BY ticker ROWS 5 PRECEDING) AS “avgLast5Trades”

FROM BIDS) AS S

WHERE S.”avgLast5Trades” > S.”avgLastHour” + 2 * S.”stdDevLastHour”;

  • Spikes in the error rate on a web server. For example,

SELECT STREAM ROWTIME, url, “numErrorsLastMinute”,

FROM (SELECT STREAM ROWTIME, url, “numErrorsLastMinute”,

AVG(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ‘1′ HOUR PRECEDING) AS “avgErrorsPerMinute”,

STDDEV(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ‘1′ HOUR PRECEDING) AS “stdDevErrorsPerMinute”

FROM “HttpRequestsPerMinute”) AS S

WHERE S.”numErrorsLastMinute” > S.”avgErrorsPerMinute” + 2 * S.”stdDevErrorsPerMinute”;

  • Monitoring call volumes in a call center.
  • Analytics on social/online gaming services.

In the Stream Computing context, Bollinger bands provide the high/low-water marks for monitoring activity. Whenever the level of recent activity crosses these Bollinger Band thresholds, the activity can be flagged. The streaming analytics engine can then perform additional analytics to detect patterns in the activity and to provide actionable information to regulate the system that is being monitored. At the very least, Bollinger bands can be used to filter out “uninteresting” rows from the stream, thereby reducing the load on the streaming pipeline.

At SQLstream, we used windowed aggregation functions such as AVG() OVER (…) and STDDEV() OVER (…) to establish Bollinger bands. It is necessary to compute AVG and STDDEV on sufficiently large windows of time. In a streaming context, we used sufficiently large windows of time to calculate Bollinger bands. So, as the window slides forward in time, the Bollinger bands reflect more recent activity levels. The current activity levels can then be computed on a much smaller window, potentially including only the current row in the stream. Should the current activity level cross either of the Bollinger bands, we then mark that as a spike in the activity level. The formula for Bollinger bands needs to be changed based on the data distribution, that is, to determine exactly what multiple of standard deviation is appropriate.

Coming back to my point about openness and extensibility, as you can see in the example queries above, you could execute a very similar query in Oracle or SQL server. Key features such as windowed aggregation functions, often called SQL OLAP functions, have been in SQLstream for a long time. Interestingly, SQLstream did not support STDDEV() windowed aggregation function during the POC. A lot of the SQL experts will know STDDEV can be easily rewritten using a formula involving AVG. Our Chief Architect, Julian Hyde, was quick enough to “sweeten” the deal by adding the “syntactic sugar” necessary to support STDDEV natively.

I am sure a lot of you readers have interesting ideas and questions. Please feel free to post them here and I will be happy to engage in conversation.

Intelligent Transportation and the ITSA Conference

Wednesday, June 2nd, 2010

Just back from the 2010 Intelligent Transportation Society of America’s Annual Meeting.  For those unfamiliar with intelligent transportation, I am not referring to the “shovel ready” projects that have been funded by President Obama as part of the economic stimulus package. These projects were designed to spend money and create jobs, thereby, stimulating the economy. Unlike the federal “shovel ready” projects, “network ready” intelligent transportation technologies and projects are rapidly being adopted and implemented by local and state departments of transportation that must still operate under fixed or reduced budgets. These local and state DOTs are using new technologies to “Do More with Less.”

ITSAIntelligent Transportation aims to reduce costs, delays, pollution, injuries and deaths by connecting infrastructure control and monitoring systems to the network and enabling these systems, and their operators, to communicate in real-time. Some examples of intelligent transportation solutions and control systems include dynamic speed limits that change according to traffic and road conditions, stop lights that know when you can go and the FasTrak electronic toll system that reduces congestion on the Golden Gate Bridge and other Bay Area bridges. Real-time technology is essential if these dynamic control systems are to collect your toll at 45 miles per hour or detect when it is safe to proceed through an intersection.

All of these intelligent transportation systems and devices can be thought of as “sensors” on the network. The data is collected by the sensors, streamed to a server, analyzed and eventually stored in a warehouse. (Imagine the final scene from Raiders of The Lost Ark, except with crates full of hard drives). Meanwhile, the analytic results are communicated back to the original sources (stop lights, toll booths and electronic road information signs) as well as to the mobile devices in your vehicle.

In some cases, new intelligent transportation solutions need to be integrated with legacy systems. In other cases, they simply need to be able to talk to each other. Thus, it becomes imperative that all new intelligent transportation solutions be built on a set of common, open standards. In the long run, solutions built on open standards reduce the total costs to those who implement and maintain the solutions. Open standards, and in particular, the global use of open data standards, within the intelligent transportation industry is essential, not just so that different sensors on the network and IT solutions can communicate with each other, but so that drivers can experience consistent and safe journeys as they cross from federal highways to state and local roads, always in contact with intelligent transportation systems that control these roads.