Posts Tagged ‘windowed aggregation’

How SQLstream Can Reduce Complex Business Logic to a Single SQL Query

Monday, October 18th, 2010

In the game industry, complex game logic needs to be applied to streams of events generated by gameplay.  In single player games, this logic is simply handled by applying the correct computations.  However, in an Internet based social game where millions of players interact together online, the problem takes on an entirely different dimension.  Storing the game events on disk inside of a database becomes increasingly difficult as the rate of gameplay events increases.  Logic and computation must be applied to the data and a disparate set of data must be queried to correctly update the game state.

The solution can be surprisingly simple and similar in form to storing all gameplay events in a traditional database.  SQLstream’s powerful streaming and windowed aggregation capabilities can reduce this use case and complex logic to a single query.

For example, consider a game where users create videos that are viewed and rated by other users:

  • A score for the video must be computed based upon the last 4 weeks of gameplay.
  • Let’s say a view is worth 12 points and various ratings levels are worth between 0 and 30 points.
  • The content’s score is the total points accumulated in the last week, plus 50% of the points in the week before that, plus 20% the points in the week before that and 10% of the points in the week before that.

Let’s assume we have two streams of data which contain streams of gameplay events:

--
-- A stream containing 1 row per an individual view of a video
--
CREATE STREAM "s_video_view" (
++++++++++++++++"video_id" INTEGER,+++++-- the viewed video
++++++++++++++++"user_id" INTEGER,++++++-- the id of the viewer
++++++++++++++++"performer_id" INTEGER++-- the id of the performer
++++++++++++++++);


--
-- A stream containing 1 row per an individual rating of a video
--
CREATE STREAM "s_video_rating" (
++++++++++++++++"video_id" INTEGER,+++++-- the viewed video
++++++++++++++++"performer_id" INTEGER,+-- the id of performer
++++++++++++++++"rater_id" INTEGER,+++++-- the id of the rater
++++++++++++++++"rating" INTEGER++++++++-– the rating given
++++++++++++++++);

To handle all the described business logic, we must first compute the total number of points generated from each gameplay event and add that to the stream of tuples entering the system.  The following query accomplishes this:

SELECT STREAM "performer_id", "video_id", 12 AS "points"
+++++FROM "s_video_view"
UNION ALL
SELECT STREAM "performer_id", "video_id",
++++++++++CASE "s_video_rating"."rating" WHEN 2 THEN 1
+++++++++++++++++++++++++++++++++++++++++WHEN 3 THEN 8
+++++++++++++++++++++++++++++++++++++++++WHEN 4 THEN 20
+++++++++++++++++++++++++++++++++++++++++WHEN 5 THEN 30
+++++++++++++++++++++++++++++++++++++++++ELSE 0
+++++++++++++++++++++++++++++++++++++++++END AS "points"
+++++FROM "s_video_rating"

This query will produce a stream of data containing the performer_id, video_id, and points for each scoring event in the system.

Next we must compute rolling time based score over the event stream for each unique video_id.  SQLstream provides the SQL:99, SQL:2003, and SQL:2008 standard WINDOW facility to make this easy:

WINDOW "last_7_days" AS ( PARTITION BY "video_id"
++++++++++++++++++++++++++RANGE INTERVAL '7' DAY PRECEEDING),
+++++++"last_14_days" AS ( PARTITION BY "video_id"
+++++++++++++++++++++++++++RANGE INTERVAL '14' DAY PRECEEDING),
+++++++"last_21_days" AS ( PARTITION BY "video_id"
+++++++++++++++++++++++++++RANGE INTERVAL '21' DAY PRECEEDING),
+++++++"last_28_days" AS ( PARTITION BY "video_id"
+++++++++++++++++++++++++++RANGE INTERVAL '28' DAY PRECEEDING)

These rolling windows contain all the scoring events over the last 7, 14, 21, and 28 days respectively grouped by the video_id.  This means that any aggregation function applied to that window will be applied to the stream events with the same video_id.  So, COUNT(*) OVER "last_7_days" would produce one row for each unique video_id with a scoring event in the last week.  Those rows would contain a count of the number of scoring events for each unique video_id.

By subtracting the SUM of the points in the 7 day window from the number of points in the 14 day window, we can compute the number of points in the week starting two weeks ago and ending one week ago.  This technique allows us to implement computations on rolling windows that are not bounded by the current time.

Putting the entire example together, we get the following view:

CREATE VIEW "v_video_score" AS
+++++SELECT STREAM "video_id", "performer_id",
+++++++++++++++SUM("points") OVER "last_7_days" +
+++++++++++++++++((SUM("points") OVER "last_14_days" -
+++++++++++++++++++SUM("points") OVER "last_7_days") * 0.5) +
+++++++++++++++++((SUM("points") OVER "last_21_days" -
+++++++++++++++++++SUM("points") OVER "last_14_days") * 0.2) +
+++++++++++++++++((SUM("points") OVER "last_28_days" -
+++++++++++++++++++SUM("points") OVER "last_21_days") * 0.1) AS "score"
++++++++FROM ( SELECT STREAM "performer_id", "video_id", 12 AS "points"
++++++++++++++++++++FROM "s_video_view"
+++++++++++++++UNION ALL
+++++++++++++++SELECT STREAM "performer_id", "video_id",
+++++++++++++++++++++++++++CASE "s_video_rating"."rating" WHEN 2 THEN 1
+++++++++++++++++++++++++++++++++++++++++++++++++WHEN 3 THEN 8
+++++++++++++++++++++++++++++++++++++++++++++++++WHEN 4 THEN 20
+++++++++++++++++++++++++++++++++++++++++++++++++WHEN 5 THEN 30
+++++++++++++++++++++++++++++++++++++++++++++++++ELSE 0
+++++++++++++++++++++++++++++++++++++++++++++++++END AS "points"
++++++++++++++++++++FROM "s_video_rating" )
++++++++WINDOW "last_7_days" AS ( PARTITION BY "video_id"
++++++++++++++++++++++++++++++++++RANGE INTERVAL '7' DAY PRECEEDING),
+++++++++++++++"last_14_days" AS ( PARTITION BY "video_id"
+++++++++++++++++++++++++++++++++++RANGE INTERVAL '14' DAY PRECEEDING),
+++++++++++++++"last_21_days" AS ( PARTITION BY "video_id"
+++++++++++++++++++++++++++++++++++RANGE INTERVAL '21' DAY PRECEEDING),
+++++++++++++++"last_28_days" AS ( PARTITION BY "video_id"
+++++++++++++++++++++++++++++++++++RANGE INTERVAL '28' DAY PRECEEDING);

which produces a stream of rows containing the video_id, performer_id, and a score computed over the rolling windows.  This view outputs a row every time a new event is inserted added the system.  The complex business logic has been reduced to one SQL query.

By attaching the results of this query to a foreign table, a nearly real-time cache of the videos and their current scores can be maintained in your database.  Alternatively, by using SQLstream’s JDBC driver, a distributed caching system such as memcached could be kept updated with the latest scores for each video simply by calling this query.  But those are topics for another post.

I hope you enjoyed this simplified real world example of how streaming SQL can:

  • simplify your business logic processing
  • improve your ability to deliver real-time data to your customers, clients, and colleagues.

Streaming SQL and Bollinger Bands

Thursday, June 10th, 2010

Last year has been an interesting experience as I participated in a number of customer “Proof Of Concept” projects for SQLstream. Developing these real-time, stream computing projects greatly increased my appreciation for the advantages of an open, extensible and standards-compliant middleware infrastructure.

For example, I needed to implement an “edge detection” mechanism for a POC project. My colleagues at SQLstream recommended using “Bollinger bands” for determining outliers. So, I browsed through the  wikipedia entry for Bollinger Bands to learn more. Bollinger bands are very similar to standard deviations or quartile deviations. A Standard deviation measures variability or dispersion in data distribution. Bollinger bands, on the other hand, provide thresholds to filter outliers in the data. In fact, Bollinger bands are based on the moving average and moving standard deviation of the data set. For typical data sets, Bollinger bands can be defined as:

lowerBB(lower Bollinger Band) = avg – (k * stddev),

upperBB(upper Bollinger Band) = avg + (k* stddev)

where avg and stddev are the average and standard deviation over a sufficiently large time window and k is the constant that needs to be determined for the activity being monitored. For typical data sets, k = 2 will create the upper bollinger band at 95th percentile of the data set.

Bollinger Bands are widely used in the financial services industry. However, Bollinger Bands can be applied to solve problems in other industries. (As I am not claiming to be a statistics expert, I would certainly appreciate honest feedback on our application of Bollinger bands in streaming queries.)

Bollinger bands certainly are a good tool to identify sudden spikes in the activity being monitored in real-time. A number of examples come to my mind,

  • Sudden spikes in the price for a ticker symbol in a stock exchange. For example,

SELECT STREAM ROWTIME, ticker, price,

FROM (SELECT STREAM ROWTIME, ticker, price,

AVG(price) OVER (PARTITION BY ticker RANGE INTERVAL ‘1′ HOUR PRECEDING) AS “avgLastHour”,

STDDEV(price) OVER (PARTITION BY ticker RANGE INTERVAL ‘1′ HOUR PRECEDING) AS “stdDevLastHour”,

AVG(price) OVER (PARTITION BY ticker ROWS 5 PRECEDING) AS “avgLast5Trades”

FROM BIDS) AS S

WHERE S.”avgLast5Trades” > S.”avgLastHour” + 2 * S.”stdDevLastHour”;

  • Spikes in the error rate on a web server. For example,

SELECT STREAM ROWTIME, url, “numErrorsLastMinute”,

FROM (SELECT STREAM ROWTIME, url, “numErrorsLastMinute”,

AVG(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ‘1′ HOUR PRECEDING) AS “avgErrorsPerMinute”,

STDDEV(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ‘1′ HOUR PRECEDING) AS “stdDevErrorsPerMinute”

FROM “HttpRequestsPerMinute”) AS S

WHERE S.”numErrorsLastMinute” > S.”avgErrorsPerMinute” + 2 * S.”stdDevErrorsPerMinute”;

  • Monitoring call volumes in a call center.
  • Analytics on social/online gaming services.

In the Stream Computing context, Bollinger bands provide the high/low-water marks for monitoring activity. Whenever the level of recent activity crosses these Bollinger Band thresholds, the activity can be flagged. The streaming analytics engine can then perform additional analytics to detect patterns in the activity and to provide actionable information to regulate the system that is being monitored. At the very least, Bollinger bands can be used to filter out “uninteresting” rows from the stream, thereby reducing the load on the streaming pipeline.

At SQLstream, we used windowed aggregation functions such as AVG() OVER (…) and STDDEV() OVER (…) to establish Bollinger bands. It is necessary to compute AVG and STDDEV on sufficiently large windows of time. In a streaming context, we used sufficiently large windows of time to calculate Bollinger bands. So, as the window slides forward in time, the Bollinger bands reflect more recent activity levels. The current activity levels can then be computed on a much smaller window, potentially including only the current row in the stream. Should the current activity level cross either of the Bollinger bands, we then mark that as a spike in the activity level. The formula for Bollinger bands needs to be changed based on the data distribution, that is, to determine exactly what multiple of standard deviation is appropriate.

Coming back to my point about openness and extensibility, as you can see in the example queries above, you could execute a very similar query in Oracle or SQL server. Key features such as windowed aggregation functions, often called SQL OLAP functions, have been in SQLstream for a long time. Interestingly, SQLstream did not support STDDEV() windowed aggregation function during the POC. A lot of the SQL experts will know STDDEV can be easily rewritten using a formula involving AVG. Our Chief Architect, Julian Hyde, was quick enough to “sweeten” the deal by adding the “syntactic sugar” necessary to support STDDEV natively.

I am sure a lot of you readers have interesting ideas and questions. Please feel free to post them here and I will be happy to engage in conversation.