SQLstream has been powering Mozilla’s Firefox Download Monitor since 2009. A SQLstream based application has been continually aggregating hundreds of millions of download events, receiving minute by minute aggregations via a continuously running SQL SELECT statement using the SQLstream JDBC driver. A continuously running SELECT statement is syntactically and semantically identical to other SELECT statements with the addition that end of data is never returned in SQLSTATE by the FETCH associated with the cursor.
For the launch of Firefox 4, Mozilla again turned to SQLstream to enhance the download monitor. Applications in SQLstream are built by defining a series of SQL stream definitions and SQL views. Business rules are embedded in these definitions. Definitions are assembled into a pipeline and each definition provides a point where data is available to applications. A stream definition is analogous to a SQL table definition and contains the column names and data types for each defined element. Each stream has an implicit ROWTIME column, a monotonically increasing value associated with the data in each column.
The download monitor tails the log files written by all of Mozilla’s download servers which provide new versions of Firefox. Each entry in the log is parsed. The IP address of each download is converted to a country, city, region, latitude, and longitude. For the Firefox 4 release, Mozilla wanted the results of the download aggregation to be stored in an HBase table in their HBase/Hadoop cluster. Storing the data allows future historical analysis to complement the realtime analysis provided by SQLstream.
SQLstream aggregates downloads to two separate column families in a single HBase table. The ‘product’ column family contains the overall download count. The ‘location’ column family contains the count of downloads for each country, region, city, latitude, longitude.
SQLstream uses a GROUP BY clause along with a COUNT(*) to calculate the number of downloads for each 10 seconds. The author wrote a new piece of SQLstream code which provides an interface from SQLstream to HBase. The HBaseAgent maps the results of the GROUP BY and calls the HBase API to persist data in HBase. The incrementColumnValue API is key in that it allows SQLstream to aggregate download counts on a realtime basis and efficiently update HBase by providing incremental values.
The application periodically reads data from HBase and sends formatted data to each connected browser. See for yourself at http://glow.mozilla.org/. The map provides a running count of Mozilla Firefox 4.0 download with raindrops on the map indicating each location where one or more downloads have just occurred. As I write this blog entry, I can see that Europe and Asia are hot while North America is just waking up. Clicking on the colored rings in the lower left hand corner of the map, allows drill down to the geographic locations.
Mozilla’s Daniel Einspanjer has also blogged about their new real-time download vizualization application. The blog explains the overall architecture of the real-time application using SQLstream, and the SQLstream integration with HBase.
You can read more about the previous Firefox 3 download monitor on Julian Hyde’s blog. Julian is the CTO of SQLstream.
Mozilla also blogged about the history of the Firefox 3 download monitor on the Mozilla Webdev blog.