Blog

March 23, 2011

Since Tuesday’s announcement that the Firefox Download Monitor is powered by SQLstream, we’ve received a number of questions about how it all fits together. I hope this description helps answer some of those questions.

Mozilla Firefox Real-Time Download Monitor - Day 2

SQLstream server executes SQL statements, just like standard SQL, except the SQLstream’s queries run continuously, analyzing input data in real-time as it arrives. Statements are presented via JDBC or user friendly tools which use JDBC internally. Statements are compiled/prepared, the planner/optimizer chooses an access plan, and a runtime engine executes the plan. SQLstream is compliant with SQL 2008 and 2003 with just a couple of extensions. One extension includes the keyword STREAM as part of a SELECT statement. The STREAM keyword indicates that the results are continuously streaming rather than a point in time TABLE.

Applications in SQLstream are constructed out of a set of SQL CREATE STREAM statements and SQL VIEWS against streams and other views. Those statements are assembled into a pipeline. When describing a pipeline we refer to statements on the source side as being upstream, and statements closer to the destination as being downstream.

In the middle of the pipeline, we define a stream named FirefoxDownloadStream_ which contains the results of the parsed and conditioned download events. The stream declaration is identical to a table definition with the exception of the type of the object being a STREAM rather than TABLE.

CREATE STREAM "FirefoxDownloadStream_" (
+++"download_type"++++++++++VARCHAR(15),
+++"utc_timestamp"++++++++++TIMESTAMP,
+++"product_name"+++++++++++VARCHAR(12),
+++"product_version"++++++++VARCHAR(12),
+++"product_major_version"++VARCHAR(12),
+++"product_os"+++++++++++++VARCHAR(10),
+++"locale_code"++++++++++++VARCHAR(5),
+++"country_code"+++++++++++VARCHAR(2),
+++"city_name"++++++++++++++VARCHAR(32),
+++"region_code"++++++++++++VARCHAR(2),
+++"longitude"++++++++++++++VARCHAR(8),
+++"latitude"+++++++++++++++VARCHAR(8)
);

The stream is populated with a SQL INSERT-SELECT statement. Again, standard SQL statements are used. The WHERE clause defines that the downloads include new first time downloads, complete upgrades of prior versions of Firefox, or partial upgrades of prior versions of Firefox.

INSERT INTO "FirefoxDownloadStream_"
++("download_type",
+++"utc_timestamp",
+++"product_name",
+++"product_version",
+++"product_major_version",
+++"product_os",
+++"locale_code",
+++"country_code",
+++"city_name",
+++"region_code",
+++"longitude",
+++"latitude"
++)
SELECT STREAM
+++"dlType"+++AS "download_type",
+++"dlTime"+++AS "utc_timestamp",
+++"product"++AS "product_name",
+++"version"++AS "product_version",
+++"GetMajorVersion"("version") AS "product_major_version",
+++"os"+++++++AS "product_os",
+++"lang",++++AS "locale_code",
+++"cc",++++++AS "country_code",
+++"city",++++AS "city_name",
+++"rg",++++++AS "region_code",
+++CAST("latitude" AS VARCHAR(10)) AS "latitude",
+++CAST("longitude" AS VARCHAR(10)) AS "longitude"
FROM "FirefoxCountryFilter"
WHERE (("dlType" IS NULL) OR ("dlType" = 'complete') OR ("dlType" = 'partial'));

The download events contain the time of each download. Mozilla has a number of download servers feeding the worldwide requests to download Firefox. Each of these servers feeds the results of the download requests to a common logfile which is “tailed” by SQLstream. As the time for each download differs due to each client’s network capacity, the download requests may be slightly out of order. In practice the biggest gap we’ve seen is 4 seconds. Since we’re measuring downloads over the long period of time, it was deemed sufficient to adjust the download time of late arrivals to match the most recent download time.
The following SQL statement does that adjustment.


CREATE OR REPLACE VIEW "FirefoxDownloadStream" AS
+++SELECT STREAM MAX("utc_timestamp") OVER(ROWS UNBOUNDED PRECEDING)
++++++++++AS ROWTIME,
++++++++++*
+++FROM "FirefoxDownloadStream_";

SQLstream associates a ROWTIME with each row in a STREAM. The ROWTIME is a monotonically increasing SQL timestamp. In the default case, the ROWTIME is the current time expressed in UTC. Most applications require time to be defined according to time associated with the data itself. Associating the ROWTIME of a row in a stream based on the data contents of the row, is done by the AS ROWTIME clause for an individual column. In the Mozilla pipeline, we set the ROWTIME to be the maximum of the values in the “utc_timestamp” column to be that rows ROWTIME.
The analytics portion of the pipeline is implemented with a standard SQL statement. For example, each 10 seconds the number of downloads for each product, version, … country, city, region is calculated.


CREATE OR REPLACE VIEW "FirefoxStreamForLocationCounters"
DESCRIPTION 'Compute product counters for a minute' AS
+++SELECT STREAM
++++++++++"download_type",
++++++++++"product_name",
++++++++++"product_major_version",
++++++++++"product_version",
++++++++++"country_code",
++++++++++"region_code",
++++++++++"city_name",
++++++++++"latitude",
++++++++++"longitude",
++++++++++count(*) AS "count"
+++FROM "FirefoxDownloadStream" F
+++GROUP BY FLOOR(F.ROWTIME TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '10' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '20' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '30' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '40' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '50' SECOND TO MINUTE),
++++++++++++"product_name",
++++++++++++"download_type",
++++++++++++"product_major_version",
++++++++++++"product_version",
++++++++++++"country_code",
++++++++++++"region_code",
++++++++++++"city_name",
++++++++++++"latitude",
++++++++++++"longitude";

There is a similar view declaration where similar calculations are done for each product. Most of the interest since Tuesday is of course related to Firefox 4.0 downloads. This second view allows Mozilla to drill down on downloads by platform as well as downloads for previous (and future) Firefox versions.


CREATE OR REPLACE VIEW "FirefoxStreamForProductCounters"
DESCRIPTION 'Compute product counters for a minute' AS
+++SELECT STREAM
++++++++++"download_type",
++++++++++"product_name",
++++++++++"product_major_version",
++++++++++"product_version",
++++++++++"product_os",
++++++++++count(*) AS "count"
+++FROM "FirefoxDownloadStream" F
+++GROUP BY FLOOR(F.ROWTIME TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '10' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '20' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '30' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '40' SECOND TO MINUTE),
++++++++++++FLOOR(F.ROWTIME - INTERVAL '50' SECOND TO MINUTE),
++++++++++++"product_name",
++++++++++++"download_type",
++++++++++++"product_major_version",
++++++++++++"product_version",
++++++++++++"product_os";

Each of these views (FirefoxStreamForLocationCounters and FirefoxStreamForProductCounters) is based on the FirefoxDownloadStream. Each defined stream and view is a point where the application can access data either directly or via another VIEW or INSERT…SELECT.

One component of the solution is a piece of code we call the HBaseAgent. The agent uses the JDBC interface to SQLstream and issues a SELECT * FROM each of the described views containing the location and product counter 10-second download counts. The HBaseAgent maps each fetched row to the HBase schema as defined by Mozilla.

I write this blog about 24 hours after Firefox 4 launched. So far there are more than 8 million downloads of Firefox 4. It certainly has been an exciting day for Mozilla and I congratulate everyone who contributed. I’m happy that SQLstream has been able to contribute to their success.

March 22, 2011

SQLstream has been powering Mozilla’s Firefox Download Monitor since 2009. A SQLstream based application has been continually aggregating hundreds of millions of download events, receiving minute by minute aggregations via a continuously running SQL SELECT statement using the SQLstream JDBC driver. A continuously running SELECT statement is syntactically and semantically identical to other SELECT statements with the addition that end of data is never returned in SQLSTATE by the FETCH associated with the cursor.

Mozilla 4.0 Real-Time Download Monitor is Powered By SQLstream

For the launch of Firefox 4, Mozilla again turned to SQLstream to enhance the download monitor. Applications in SQLstream are built by defining a series of SQL stream definitions and SQL views. Business rules are embedded in these definitions. Definitions are assembled into a pipeline and each definition provides a point where data is available to applications. A stream definition is analogous to a SQL table definition and contains the column names and data types for each defined element. Each stream has an implicit ROWTIME column, a monotonically increasing value associated with the data in each column.

The download monitor tails the log files written by all of Mozilla’s download servers which provide new versions of Firefox. Each entry in the log is parsed. The IP address of each download is converted to a country, city, region, latitude, and longitude. For the Firefox 4 release, Mozilla wanted the results of the download aggregation to be stored in an HBase table in their HBase/Hadoop cluster. Storing the data allows future historical analysis to complement the realtime analysis provided by SQLstream.

SQLstream aggregates downloads to two separate column families in a single HBase table. The ‘product’ column family contains the overall download count. The ‘location’ column family contains the count of downloads for each country, region, city, latitude, longitude.

SQLstream uses a GROUP BY clause along with a COUNT(*) to calculate the number of downloads for each 10 seconds. The author wrote a new piece of SQLstream code which provides an interface from SQLstream to HBase. The HBaseAgent maps the results of the GROUP BY and calls the HBase API to persist data in HBase. The incrementColumnValue API is key in that it allows SQLstream to aggregate download counts on a realtime basis and efficiently update HBase by providing incremental values.

The application periodically reads data from HBase and sends formatted data to each connected browser. See for yourself at http://glow.mozilla.org/. The map provides a running count of Mozilla Firefox 4.0 download with raindrops on the map indicating each location where one or more downloads have just occurred. As I write this blog entry, I can see that Europe and Asia are hot while North America is just waking up. Clicking on the colored rings in the lower left hand corner of the map, allows drill down to the geographic locations.

Mozilla’s Daniel Einspanjer has also blogged about their new real-time download vizualization application. The blog explains the overall architecture of the real-time application using SQLstream, and the SQLstream integration with HBase.

You can read more about the previous Firefox 3 download monitor on Julian Hyde’s blog. Julian is the CTO of SQLstream.

Mozilla also blogged about the history of the Firefox 3 download monitor on the Mozilla Webdev blog.

July 29, 2010

Railroads have used track side readers to scan bar codes on the sides of freight cars since the 1970s. Such sensors provided real time tracking of goods as they made their way from the supplier to the delivery point. Retail businesses increased the use of RFID tags in the past 20 years to track goods through the manufacturing process. Since the Indian Ocean tsunami of December 2004 the public has become aware of deep water pressure sensors which sit on the ocean floor to detect tsunamis and are intended to generate warnings about potential disasters.

The cost of sensors has decreased significantly in recent years and as a result inexpensive sensors are present nearly everywhere in businesses. As the price of sensors decreases it becomes economically feasible to deploy thousands and even millions of sensors. Such sensors cumulatively generated huge volumes of data. Imagine placing a sensor capable of measuring temperature, humidity, sun light and air pressure sensor within each square kilometer in the state of Iowa to assist farmers in managing crop production. Now imagine each of those 145,743 sensors generating 100 bytes of data every minute resulting in a data volume of nearly 21GB per day.

There is much buzz about Big Data and the challenges of applying traditional database management tools to extract business value from such data. Fortunately, there is a better way – integrating real time data, as provided by sensors, with stream analytic processing, allows timely enterprise decisions in response to changing conditions.

I urge you to read Damian Black’s recent postings on this blog describing the SQLstream approach to “Big Data”.

(more…)