Can GPS solutions really monitor parolees in real-time?

A recent San Francisco Chronicle article described using GPS anklets to track former gang members on parole, expanding a program first used on paroled sex offenders.

The concept is great: if you know where parolees are, you can make sure they don’t violate parole restrictions (or catch them if they do). But the technology doesn’t always achieve the goal, for the simple reason that someone has to be tracking the anklets.

Technology can also help solve the problem. Standard data warehousing practices can tell authorities if someone violated parole yesterday, but having the information at the moment parole violations are occurring enables reacting quickly and decisively. In some cases, GPS time-&-location data can prevent crimes as well as aid in solving them after the fact.

Real time data analytics can do this by monitoring the information in real time and sending a page or other alert to parole officers as soon as a violation occurs.  At SQLstream, we’ve been working with customers to monitor real-time data from such sensors, providing instant real-time reports and alerts against pre-determined boundaries of time or space.

It’s also a growing market across the globe. In one state in Germany, ankle bracelets for monitoring time or location boundaries assigned to offenders on probation appear effective, helping probationers stick to a regular schedule, among other benefits.

The business case for better technology is driven by the potential for huge cost savings.  GPS anklet solutions cost only one-third what incarceration costs: about 33 euros per day versus about 100 euros for a day in jail (about $44 and $133, respectively). In California, using numbers from the Chronicle article and the California Department of Corrections and Rehabilitation, GPS ankle monitoring costs about $26 per day, while jail time averages about $133.

Such devices are also in use in other American communities, and German State justice ministers were scheduled to meet June 30th to discuss implementing the bracelets in other parts of Germany.

So it looks like GPS anklets for dangerous parolees may be here to stay, and with a step change in the supporting monitoring technology, true real-time analysis and reporting of exceptions and corrective action can be a reality.

Streaming Sensor Data

Railroads have used track side readers to scan bar codes on the sides of freight cars since the 1970s. Such sensors provided real time tracking of goods as they made their way from the supplier to the delivery point. Retail businesses increased the use of RFID tags in the past 20 years to track goods through the manufacturing process. Since the Indian Ocean tsunami of December 2004 the public has become aware of deep water pressure sensors which sit on the ocean floor to detect tsunamis and are intended to generate warnings about potential disasters.

The cost of sensors has decreased significantly in recent years and as a result inexpensive sensors are present nearly everywhere in businesses. As the price of sensors decreases it becomes economically feasible to deploy thousands and even millions of sensors. Such sensors cumulatively generated huge volumes of data. Imagine placing a sensor capable of measuring temperature, humidity, sun light and air pressure sensor within each square kilometer in the state of Iowa to assist farmers in managing crop production. Now imagine each of those 145,743 sensors generating 100 bytes of data every minute resulting in a data volume of nearly 21GB per day.

There is much buzz about Big Data and the challenges of applying traditional database management tools to extract business value from such data. Fortunately, there is a better way – integrating real time data, as provided by sensors, with stream analytic processing, allows timely enterprise decisions in response to changing conditions.

I urge you to read Damian Black’s recent postings on this blog describing the SQLstream approach to “Big Data”.

Read the rest of this entry »

Structure10 – after the “Big Data” event

GigaOM Structure 2010 Big Data and Cloud ComputingLast week I was on a panel for “Big Data” at Structure2010 – a GigaOm event. As usual, it was very well run and there was a large throng of silicon valley luminaries ranging from entrepreneurs to venture capitalists scattered in with some large customers and users of technology. We clearly have moved on a long way from the days when I was told to change my slides and remove the cloud graphic and replace it with a box because “clouds are cloudy” (direct quotation from a tier one venture capitalist – I wish to protect his identity to avoid personal embarrassment).

SQLstream is already the market leader in applying stream computing to Intelligent Transportation Systems, and we also have the opportunity to provide a similar impact to the Cloud Computing Service Monitoring space. It seems we have exactly the perfect solution to provide real-time insights into service usage, bottlenecks, error rates and service level compliance. And you can add regulatory compliance to that list too – from the continuous alerting side to complement the excellent historical solutions that are out there.

From the presentations at the show, it is clear that Cloud Computing has truly come of age. SQLstream uses cloud services for all demonstrations and also in our QA and Engineering processes. We also have customers deploying in the cloud. The latest emerging cloud solutions fill in many of the former technology gaps, allowing seamless integration into or transition from traditional data centers. You can even run your own private clouds leveraging the same APIs available on the public clouds.

On the Big Data front, on the panel alongside SQLstream were a Hadoop vendor and a high-performance column store data warehouse vendor. The other two panelists were users of “big data” technologies. It was interesting to discover that we already had two implementations where SQLstream operates in concert with or in parallel with the other two panelist vendors’ technologies.

There is even a customer (Mozilla) that uses all three technology approaches for download analytics – Hadoop in the form of HBase and a column store data warehouse for historical SQL queries over downloads, and SQLstream to generate high-performance continuous real-time analytics and reporting on download statistics for all versions of FireFox. This clearly demonstrates that there is a role for each of the Big Data technologies high-lighted on the panel, and an interesting and growing market opportunity. It also indicates some clear partnership opportunities.

I look forward to seeing the developments in our space and in cloud computing over the coming year and hope to be invited back again soon. We were originally present on the Big Data panel at GigaOm’s inaugural Structure2008 event, so I guess we should be set for a reappearance at Structure2012?! If so, I am sure we will have some exciting new stories to share.

Here is a link to the video recording of the panel session. A big thank-you to Phil Hendrix for his excellent moderation of the panel and the professional preparation work he did beforehand so that the actual event went smoothly.

Big Data – Dealing with the Data Tsunami

GigaOM Structure 2010 Big Data and Cloud Computing There is a lot of buzz these days about the challenge of “Big Data”.  I’ll be speaking on the subject at GigaOM’s Structure2010, on the “DEALING WITH THE DATA TSUNAMI: THE BIG DATA” panel. There are many dimensions to the challenges posed by “Big Data”, which I’ve presented here as five separate but related themes.

Speed of data arrival

The first theme is speed.  When a lot of data arrive fast, it is often overlooked that they arrive in raw form and need to be processed or cooked before they can be of any real value. The processing normally comprises cleaning, filtering, aggregating and validating.  Sometimes the data need to be enhanced, normalized or de-normalized.  While there are a number of proprietary ETL tools out there that can help, most people prefer to perform these operations using SQL.  This approach has become known as ELT as the data are Extracted, Loaded and then Transformed (as opposed to Transformed then Loaded).  In the past, this has meant loading raw data into a data warehouse’s staging tables and then performing the ELT with SQL in batches until the data are fully cooked and ready to take part in the “main course” queries. 

One of the strengths of the SQLstream approach is that for the first time you can use standards-based SQL for performing these ELT steps but as Continuous ETL rather than operating upon the data after first storing it.  We call this “analyze-before-store” approach: Query the Future – as the scope of the continuous queries is from the moment they start until the end of future time (in contrast with historical queries whose scope is from the moment they start until as far back in time as the data are stored).  SQLstream’s queries continuously process, clean, aggregate and enhance the data in a highly parallelized dataflow pipelined process.  The staging is in main memory using 64-bit architecture and multiple cores and servers.  This provides a highly scalable efficient and cost effective solution to ETL, with the virtuous side-effect of enabling the data warehouse to be kept continuously up-to-date by feeding it a stream of fully cooked data and updating its aggregate tables continuously in near real-time.  All of this is done without stealing valuable cycles of the data warehouse server.

Data location

The second theme is data location.  Like houses, location is very important when it comes to assessing the value (or usefulness) of the data.  Location might be spatial or temporal.  If you wish to be alerted of a special price for gas at a specific gas station, clearly it is of greater value if you are currently in the immediate vicinity of the gas station.  This shows the value of both the location in space and the location in time.  In contrast, most data warehouses dumbly store all service data and records without regard to their value. 

Clearly, the value of the data in many cases greatly diminishes over time.  Many of the queries that a business might pose are better targeted at current data.  That is particularly true of targeted advertisements, but also when monitoring customer service level, cloud computing infrastructure and the like.  The data are much more valuable when the business is able to take proactive initiative to capitalize on the value – fixing problems or issues before they negatively impact customers, or making that promotion or sale before the customer purchases product or service from a competitor.  SQLstream’s continuous queries are all about focusing analytics where they have the most value by specifying explicit windows of focus for the queries in terms of time, quantity or space.  While many rows can flow into and out of the window of focus for any given query, the window represents the immediate focus of attention.

Pace of change

The third theme is the pace of change of data.  If you have a large quantity of data that is not changing very much, then historical queries and analysis will no doubt provide you with all of your answers.  However, if the data are changing constantly, or a lot of new data arriving constantly, or if you have a focus on a specific window of time or space, then historical analysis has little value.  What you care about is the derivative of the change – the rates of change.  For example, are our sales accelerating or decelerating?  Is the rate of acceleration unusually high or low?  What about service outages and error rates?  Or customer complaints?  The SQLstream approach enables you to see what is changing rather than what is staying the same.  It is analogous to predator vision: the predators want to see what is moving and their vision system prioritizes that over what remains motionless.  SQLstream provides such dynamic vision.

Balancing historical and continuous analysis

The fourth theme is the need to complement data mining and the results of historical analysis with continuous analysis.  Data warehousing allows you to find patterns and predictors from past data and to back test all of your hypotheses over extended periods of time.  The back testing of such hypotheses often takes the form of SQL queries that search for patterns of changes of data over time and check that the predicted results occurred and with what frequency.  Once you have mined and captured such valuable predictors, it is straightforward to take the SQL you have generated and tweak it to be used in real-time, continuously executed against live data.  Using this approach, SQLstream allows you to leverage you data mining results to perform real-time predictive analytics, giving your business a real-time heads up for key indicators of buying signals, or systems’ failure or what ad should be served up based on a customer’s web behavior.

Brain over brawn processing

My fifth and final theme is “smart declarative” versus “dumb brute force” when applied to data queries.  The latter is how I see Hadoop-based approaches.  You parallelize a problem to take advantage of a lot of available servers and related CPU cycles, but you do not rely on any intelligence on how you partition the problem.  In fact not having to “think” is one of the primary appeals of the technique.  It is a brute force method of brawn over brain.  However, where the problem space is truly huge, or the time or financial budget is more limited, there is always the attraction of the “brain over brawn” technique.  Declarative SQL processing draws upon the mathematical tractability of analyzing patterns and dependencies within the data, the use of keys and indexing, the rewriting of complex formulae into simpler ones and avoiding recalculation of intermediate results – in order to provide a faster, more efficient and smarter way of finding the solutions.  Such declarative techniques can still take extensive advantage of parallelism and inexpensive or available servers and CPU cycles, but they rely on smart analysis in order to optimize the calculations.  SQLstream, and all SQL-based data warehouses, heavily draw upon these mathematical SQL properties and patterns and analysis of the data to do the smart thing when it comes to query processing. 

Stream Computing of the kind embodied by SQLstream however has even greater potential to take advantage of parallelism over and above SQL data warehouses because SQLstream’s Stream Computing has no transactional bottleneck and is purely declarative.  Input streams are not “side-effected” by the execution of stream SQL statements, rather new streams are created from the original ones (which are left untouched and can be presented concurrently to other SQLstream servers).  The execution paradigm is one of parallel dataflow execution – a paradigm that lends itself not only to massive parallel execution but also to massively distributed execution.  I believe that as Hadoop becomes more widely understood and deployed, people will begin to see just how much of a better job could be performed by adding a little intelligence and just how powerful declarative stream computing can be.

OLAP change notification, and the CellSetListener API

There has been an interesting design discussion on the olap4j forums about how an OLAP server could notify its client that the data set has changed. It is exciting because it would allow us to efficiently update OLAP displays in real-time.

With Mondrian, we came up with an API, at the center of which is the new interface CellSetListener, which I have just checked into olap4j’s subversion repository. (The Mondrian API is experimental. That means you shouldn’t expect to find a working implementation just yet, or assume that the API won’t change radically before it is finalized, but it does mean we are still very much open to suggestions for improvements.)

Of course, OLAP notifications are a subject close to my heart, because they bring together my interests in SQLstream and mondrian. ‘Push-based’ computing is challenging, because every link in the chain needs to propagate the events to the next link. In a previous post I described how SQLstream could do continuous ETL, populate fact and aggregate tables incrementally, and notify mondrian that data items in its cache were out of date.

A mondrian implementation of the CellSetListener API would cause mondrian to internally re-evaluate all queries that have listeners and cover an affected area of the cache. If the results of those queries changed, mondrian would transmit those notifications to OLAP client applications such as Pentaho Analyzer or PAT. The client application would then change the value of the cell on the screen, and maybe change the cell’s background color momentarily to attract the user’s attention.

Getting data to change on the screen, in front of the end-user’s eyes, within seconds of the data changing in the operational system, would be truly spectacular.

There are several links in the chain to make that happen. Two of the links, SQLstream and mondrian’s cache control API, are already complete. We’ve just begun forging the next link.

This blog was posted originally on my personal Open Source Olap blog, but has been posted here as it is relevant, and hopefully of interest, to SQLstream users interested in our real-time OLAP solutions.

Streaming SQL and Bollinger Bands

Last year has been an interesting experience as I participated in a number of customer “Proof Of Concept” projects for SQLstream. Developing these real-time, stream computing projects greatly increased my appreciation for the advantages of an open, extensible and standards-compliant middleware infrastructure.

For example, I needed to implement an “edge detection” mechanism for a POC project. My colleagues at SQLstream recommended using “Bollinger bands” for determining outliers. So, I browsed through the  wikipedia entry for Bollinger Bands to learn more. Bollinger bands are very similar to standard deviations or quartile deviations. A Standard deviation measures variability or dispersion in data distribution. Bollinger bands, on the other hand, provide thresholds to filter outliers in the data. In fact, Bollinger bands are based on the moving average and moving standard deviation of the data set. For typical data sets, Bollinger bands can be defined as:

lowerBB(lower Bollinger Band) = avg – (k * stddev),

upperBB(upper Bollinger Band) = avg + (k* stddev)

where avg and stddev are the average and standard deviation over a sufficiently large time window and k is the constant that needs to be determined for the activity being monitored. For typical data sets, k = 2 will create the upper bollinger band at 95th percentile of the data set.

Bollinger Bands are widely used in the financial services industry. However, Bollinger Bands can be applied to solve problems in other industries. (As I am not claiming to be a statistics expert, I would certainly appreciate honest feedback on our application of Bollinger bands in streaming queries.)

Bollinger bands certainly are a good tool to identify sudden spikes in the activity being monitored in real-time. A number of examples come to my mind,

  • Sudden spikes in the price for a ticker symbol in a stock exchange. For example,

SELECT STREAM ROWTIME, ticker, price,

FROM (SELECT STREAM ROWTIME, ticker, price,

AVG(price) OVER (PARTITION BY ticker RANGE INTERVAL ‘1′ HOUR PRECEDING) AS “avgLastHour”,

STDDEV(price) OVER (PARTITION BY ticker RANGE INTERVAL ‘1′ HOUR PRECEDING) AS “stdDevLastHour”,

AVG(price) OVER (PARTITION BY ticker ROWS 5 PRECEDING) AS “avgLast5Trades”

FROM BIDS) AS S

WHERE S.”avgLast5Trades” > S.”avgLastHour” + 2 * S.”stdDevLastHour”;

  • Spikes in the error rate on a web server. For example,

SELECT STREAM ROWTIME, url, “numErrorsLastMinute”,

FROM (SELECT STREAM ROWTIME, url, “numErrorsLastMinute”,

AVG(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ‘1′ HOUR PRECEDING) AS “avgErrorsPerMinute”,

STDDEV(“numErrorsLastMinute”) OVER (PARTITION BY url RANGE INTERVAL ‘1′ HOUR PRECEDING) AS “stdDevErrorsPerMinute”

FROM “HttpRequestsPerMinute”) AS S

WHERE S.”numErrorsLastMinute” > S.”avgErrorsPerMinute” + 2 * S.”stdDevErrorsPerMinute”;

  • Monitoring call volumes in a call center.
  • Analytics on social/online gaming services.

In the Stream Computing context, Bollinger bands provide the high/low-water marks for monitoring activity. Whenever the level of recent activity crosses these Bollinger Band thresholds, the activity can be flagged. The streaming analytics engine can then perform additional analytics to detect patterns in the activity and to provide actionable information to regulate the system that is being monitored. At the very least, Bollinger bands can be used to filter out “uninteresting” rows from the stream, thereby reducing the load on the streaming pipeline.

At SQLstream, we used windowed aggregation functions such as AVG() OVER (…) and STDDEV() OVER (…) to establish Bollinger bands. It is necessary to compute AVG and STDDEV on sufficiently large windows of time. In a streaming context, we used sufficiently large windows of time to calculate Bollinger bands. So, as the window slides forward in time, the Bollinger bands reflect more recent activity levels. The current activity levels can then be computed on a much smaller window, potentially including only the current row in the stream. Should the current activity level cross either of the Bollinger bands, we then mark that as a spike in the activity level. The formula for Bollinger bands needs to be changed based on the data distribution, that is, to determine exactly what multiple of standard deviation is appropriate.

Coming back to my point about openness and extensibility, as you can see in the example queries above, you could execute a very similar query in Oracle or SQL server. Key features such as windowed aggregation functions, often called SQL OLAP functions, have been in SQLstream for a long time. Interestingly, SQLstream did not support STDDEV() windowed aggregation function during the POC. A lot of the SQL experts will know STDDEV can be easily rewritten using a formula involving AVG. Our Chief Architect, Julian Hyde, was quick enough to “sweeten” the deal by adding the “syntactic sugar” necessary to support STDDEV natively.

I am sure a lot of you readers have interesting ideas and questions. Please feel free to post them here and I will be happy to engage in conversation.

Intelligent Transportation and the ITSA Conference

Just back from the 2010 Intelligent Transportation Society of America’s Annual Meeting.  For those unfamiliar with intelligent transportation, I am not referring to the “shovel ready” projects that have been funded by President Obama as part of the economic stimulus package. These projects were designed to spend money and create jobs, thereby, stimulating the economy. Unlike the federal “shovel ready” projects, “network ready” intelligent transportation technologies and projects are rapidly being adopted and implemented by local and state departments of transportation that must still operate under fixed or reduced budgets. These local and state DOTs are using new technologies to “Do More with Less.”

ITSAIntelligent Transportation aims to reduce costs, delays, pollution, injuries and deaths by connecting infrastructure control and monitoring systems to the network and enabling these systems, and their operators, to communicate in real-time. Some examples of intelligent transportation solutions and control systems include dynamic speed limits that change according to traffic and road conditions, stop lights that know when you can go and the FasTrak electronic toll system that reduces congestion on the Golden Gate Bridge and other Bay Area bridges. Real-time technology is essential if these dynamic control systems are to collect your toll at 45 miles per hour or detect when it is safe to proceed through an intersection.

All of these intelligent transportation systems and devices can be thought of as “sensors” on the network. The data is collected by the sensors, streamed to a server, analyzed and eventually stored in a warehouse. (Imagine the final scene from Raiders of The Lost Ark, except with crates full of hard drives). Meanwhile, the analytic results are communicated back to the original sources (stop lights, toll booths and electronic road information signs) as well as to the mobile devices in your vehicle.

In some cases, new intelligent transportation solutions need to be integrated with legacy systems. In other cases, they simply need to be able to talk to each other. Thus, it becomes imperative that all new intelligent transportation solutions be built on a set of common, open standards. In the long run, solutions built on open standards reduce the total costs to those who implement and maintain the solutions. Open standards, and in particular, the global use of open data standards, within the intelligent transportation industry is essential, not just so that different sensors on the network and IT solutions can communicate with each other, but so that drivers can experience consistent and safe journeys as they cross from federal highways to state and local roads, always in contact with intelligent transportation systems that control these roads.