Archive for the ‘big data’ Category

Streaming Big Data – a major trend for 2012

Wednesday, January 11th, 2012

For Big Data, 2012 has started where 2011 left off, with a plethora of reports, articles and blogs. Interestingly, most still begin with the question “what is Big Data”. It appears ‘Big Data’ as a market is broadening its footprint far beyond its open source and Hadoop origins. My favourite new term in this quest for delineation is “Small Big Data”. (Isn’t that just “Data”?)

The most interesting trend for us is streaming Big Data processing and analytics. Edd Dumbill, O’Reilly Radar, talks about this as one of the “Five big data predictions for 2012”, “Hadoop’s batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn’t need to be up-to-the-minute. However, batch processing isn’t always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising.”

The real-time use case is an obvious one. If you need to respond or be warned in real-time or near real-time, for example, security breaches or a service impacting event on a VoIP or video call, the high initial latency of batch oriented data stores such as Hadoop is not sufficient.

However, there is also an emerging discussion on the storage of Big Data for big data’s sake. This is the blind collection and storage of data without due consideration as to how it’s going to be used. Dan Woods talks about this in his recent Forbes article “Curing the Big Data Storage Fetish”. The data will never create value without analysis, and little thought has been given to increasing analytics capacity.

There are many vendors emerging for the historical analysis of Big Data repositories, either on the Hadoop platform, or on platforms from the other large scale data warehouse vendors. However, there are very few vendors in streaming Big Data analytics space, and even fewer products with the maturity, flexibility and scalability to process Big Data streams in real-time.

Streaming Big Data analytics needs to address two areas.  First, the obvious use case, monitoring across all input data streams for business exceptions in real-time. This is a given.  But perhaps more importantly, much of the data held in Big Data repositiories is of little or no business value, and will never end up in a management report. Sensor networks, IP telecommunications networks, even data center log file processing – all examples where a vast amount of ‘business as usual’ data is generated. It’s therefore important to understand what’s being stored, and only persist what’s important (which admittedly, in some cases, may be everything).  For many applications, streaming data can be filtered and aggregated prior to storing, significantly reducing the Big Data burden, and significantly enhancing the business value of the stored data.  At least until we understand why we’re trying to store everything.

SQLstream 2.5 – Real-time stream computing eliminates Big Data performance and storage bottlenecks

Monday, February 21st, 2011

With service and sensor data growing at 60% CAGR, having both the raw power and correct architecture for processing streaming data is essential. IDC released recently estimates for the size of the ‘Digital Universe’ – a term used to describe every electronically stored piece of data. According to IDC, stored data will reach 1.8 million petabytes (1800 exabytes) by the end of 2011.

Data overload (source IDC)

Data overload (source IDC)

As a recent article in the Economist points out, all of this data raises significant processing performance and storage issues. Conventional database technology requires data to be stored, cleaned and aggregated before being queried. With the volume of data growing so quickly, it has become cost prohibitive and technologically infeasible to process all data using conventional solutions.

But how much of the raw data actually needs to be stored? The value of individual data is often low, and the useful lifetime of the raw data short. However, the information content is potentially high – it’s just a matter of identifying the valuable information in the raw data.

Introducing SQLstream Server 2.5

For SQLstream, this is the future of data processing – real-time, continuous analysis of streaming data – generate operational business intelligence from live streaming data without first storing the data in a database.

For the latest release of SQLstream Server, SQLstream 2.5, we’ve focussed on the common business requirements that are required for the rapid adoption of real-time stream computing across all markets – performance, reliability and scalability. More specifically, SQLstream 2.5 offers:

  • - 10X performance improvement, benchmarked against live operational deployments on a single server installation.
  • - Scalability for mission critical applications with federated installations across multiple servers.
  • - Business critical reliability following an exhaustive stability and operational optimization program.

Of course, we’ve also addressed a range of important requirements across our customer base, in particular, additional input and output connectors built on the SQL/MED standard for integration, including:

  • - enhanced database insert/update/select Adapters.
  • - enterprise messaging integration using AMQP.
  • - enhanced Log File management and XML feed processing Adapters.

And last but by no means least, supporting the SQL:2008 standards-based streaming SQL language with new functions including:

  • - support for GROUP BY ORDER BY.
  • - new and enhanced data analysis functions for detecting unique events, such as early emit SELECT DISTINCT.
  • - support for the SQL HAVING function.
  • - and a new range of streaming statistical functions for calculating variance and standard deviation.

Most existing customers have already upgraded to SQLstream 2.5. Some examples of recent SQLstream 2.5 upgrades include customers in the following markets:

Intelligent Transportation – real-time analytics for the intelligent transportation market. A case study for SQLstream ITS Insight was featured recently in ITS International magazine, and an overview of the product’s feature can be found on www.sqlstream.com/Products/itsinsight.

Environmental monitoring and event detection – integrating with AMQP, which provides the guaranteed delivery of real-time raw data from a large sensor network, SQLstream filters (using windowed aggregation) the raw sensor and applies event detection patterns in real-time, generating a continuous stream of environmental exceptions events.

Social gaming infrastructure – working with a new entrant in the on-line social gaming market, SQLstream monitors user activity and provides continuous real-time scoring updates – including real-time incremental updates of historical, aggregated game data maintained in a back-end data warehouse.

SQLstream 2.5 is available now

But if you’d like to learn more about the Business Case for Streaming SQL, try our “Concepts in Streaming SQL” mini-white paper, or follow this blog over the next couple of months where we’ll be posting how some of these functions are being used to solve real-world problems across our customer base.

Structure10 – after the “Big Data” event

Thursday, July 1st, 2010

GigaOM Structure 2010 Big Data and Cloud ComputingLast week I was on a panel for “Big Data” at Structure2010 – a GigaOm event. As usual, it was very well run and there was a large throng of silicon valley luminaries ranging from entrepreneurs to venture capitalists scattered in with some large customers and users of technology. We clearly have moved on a long way from the days when I was told to change my slides and remove the cloud graphic and replace it with a box because “clouds are cloudy” (direct quotation from a tier one venture capitalist – I wish to protect his identity to avoid personal embarrassment).

SQLstream is already the market leader in applying stream computing to Intelligent Transportation Systems, and we also have the opportunity to provide a similar impact to the Cloud Computing Service Monitoring space. It seems we have exactly the perfect solution to provide real-time insights into service usage, bottlenecks, error rates and service level compliance. And you can add regulatory compliance to that list too – from the continuous alerting side to complement the excellent historical solutions that are out there.

From the presentations at the show, it is clear that Cloud Computing has truly come of age. SQLstream uses cloud services for all demonstrations and also in our QA and Engineering processes. We also have customers deploying in the cloud. The latest emerging cloud solutions fill in many of the former technology gaps, allowing seamless integration into or transition from traditional data centers. You can even run your own private clouds leveraging the same APIs available on the public clouds.

On the Big Data front, on the panel alongside SQLstream were a Hadoop vendor and a high-performance column store data warehouse vendor. The other two panelists were users of “big data” technologies. It was interesting to discover that we already had two implementations where SQLstream operates in concert with or in parallel with the other two panelist vendors’ technologies.

There is even a customer (Mozilla) that uses all three technology approaches for download analytics – Hadoop in the form of HBase and a column store data warehouse for historical SQL queries over downloads, and SQLstream to generate high-performance continuous real-time analytics and reporting on download statistics for all versions of FireFox. This clearly demonstrates that there is a role for each of the Big Data technologies high-lighted on the panel, and an interesting and growing market opportunity. It also indicates some clear partnership opportunities.

I look forward to seeing the developments in our space and in cloud computing over the coming year and hope to be invited back again soon. We were originally present on the Big Data panel at GigaOm’s inaugural Structure2008 event, so I guess we should be set for a reappearance at Structure2012?! If so, I am sure we will have some exciting new stories to share.

Here is a link to the video recording of the panel session. A big thank-you to Phil Hendrix for his excellent moderation of the panel and the professional preparation work he did beforehand so that the actual event went smoothly.

Big Data – Dealing with the Data Tsunami

Tuesday, June 22nd, 2010

GigaOM Structure 2010 Big Data and Cloud Computing There is a lot of buzz these days about the challenge of “Big Data”.  I’ll be speaking on the subject at GigaOM’s Structure2010, on the “DEALING WITH THE DATA TSUNAMI: THE BIG DATA” panel. There are many dimensions to the challenges posed by “Big Data”, which I’ve presented here as five separate but related themes.

Speed of data arrival

The first theme is speed.  When a lot of data arrive fast, it is often overlooked that they arrive in raw form and need to be processed or cooked before they can be of any real value. The processing normally comprises cleaning, filtering, aggregating and validating.  Sometimes the data need to be enhanced, normalized or de-normalized.  While there are a number of proprietary ETL tools out there that can help, most people prefer to perform these operations using SQL.  This approach has become known as ELT as the data are Extracted, Loaded and then Transformed (as opposed to Transformed then Loaded).  In the past, this has meant loading raw data into a data warehouse’s staging tables and then performing the ELT with SQL in batches until the data are fully cooked and ready to take part in the “main course” queries. 

One of the strengths of the SQLstream approach is that for the first time you can use standards-based SQL for performing these ELT steps but as Continuous ETL rather than operating upon the data after first storing it.  We call this “analyze-before-store” approach: Query the Future – as the scope of the continuous queries is from the moment they start until the end of future time (in contrast with historical queries whose scope is from the moment they start until as far back in time as the data are stored).  SQLstream’s queries continuously process, clean, aggregate and enhance the data in a highly parallelized dataflow pipelined process.  The staging is in main memory using 64-bit architecture and multiple cores and servers.  This provides a highly scalable efficient and cost effective solution to ETL, with the virtuous side-effect of enabling the data warehouse to be kept continuously up-to-date by feeding it a stream of fully cooked data and updating its aggregate tables continuously in near real-time.  All of this is done without stealing valuable cycles of the data warehouse server.

Data location

The second theme is data location.  Like houses, location is very important when it comes to assessing the value (or usefulness) of the data.  Location might be spatial or temporal.  If you wish to be alerted of a special price for gas at a specific gas station, clearly it is of greater value if you are currently in the immediate vicinity of the gas station.  This shows the value of both the location in space and the location in time.  In contrast, most data warehouses dumbly store all service data and records without regard to their value. 

Clearly, the value of the data in many cases greatly diminishes over time.  Many of the queries that a business might pose are better targeted at current data.  That is particularly true of targeted advertisements, but also when monitoring customer service level, cloud computing infrastructure and the like.  The data are much more valuable when the business is able to take proactive initiative to capitalize on the value – fixing problems or issues before they negatively impact customers, or making that promotion or sale before the customer purchases product or service from a competitor.  SQLstream’s continuous queries are all about focusing analytics where they have the most value by specifying explicit windows of focus for the queries in terms of time, quantity or space.  While many rows can flow into and out of the window of focus for any given query, the window represents the immediate focus of attention.

Pace of change

The third theme is the pace of change of data.  If you have a large quantity of data that is not changing very much, then historical queries and analysis will no doubt provide you with all of your answers.  However, if the data are changing constantly, or a lot of new data arriving constantly, or if you have a focus on a specific window of time or space, then historical analysis has little value.  What you care about is the derivative of the change – the rates of change.  For example, are our sales accelerating or decelerating?  Is the rate of acceleration unusually high or low?  What about service outages and error rates?  Or customer complaints?  The SQLstream approach enables you to see what is changing rather than what is staying the same.  It is analogous to predator vision: the predators want to see what is moving and their vision system prioritizes that over what remains motionless.  SQLstream provides such dynamic vision.

Balancing historical and continuous analysis

The fourth theme is the need to complement data mining and the results of historical analysis with continuous analysis.  Data warehousing allows you to find patterns and predictors from past data and to back test all of your hypotheses over extended periods of time.  The back testing of such hypotheses often takes the form of SQL queries that search for patterns of changes of data over time and check that the predicted results occurred and with what frequency.  Once you have mined and captured such valuable predictors, it is straightforward to take the SQL you have generated and tweak it to be used in real-time, continuously executed against live data.  Using this approach, SQLstream allows you to leverage you data mining results to perform real-time predictive analytics, giving your business a real-time heads up for key indicators of buying signals, or systems’ failure or what ad should be served up based on a customer’s web behavior.

Brain over brawn processing

My fifth and final theme is “smart declarative” versus “dumb brute force” when applied to data queries.  The latter is how I see Hadoop-based approaches.  You parallelize a problem to take advantage of a lot of available servers and related CPU cycles, but you do not rely on any intelligence on how you partition the problem.  In fact not having to “think” is one of the primary appeals of the technique.  It is a brute force method of brawn over brain.  However, where the problem space is truly huge, or the time or financial budget is more limited, there is always the attraction of the “brain over brawn” technique.  Declarative SQL processing draws upon the mathematical tractability of analyzing patterns and dependencies within the data, the use of keys and indexing, the rewriting of complex formulae into simpler ones and avoiding recalculation of intermediate results – in order to provide a faster, more efficient and smarter way of finding the solutions.  Such declarative techniques can still take extensive advantage of parallelism and inexpensive or available servers and CPU cycles, but they rely on smart analysis in order to optimize the calculations.  SQLstream, and all SQL-based data warehouses, heavily draw upon these mathematical SQL properties and patterns and analysis of the data to do the smart thing when it comes to query processing. 

Stream Computing of the kind embodied by SQLstream however has even greater potential to take advantage of parallelism over and above SQL data warehouses because SQLstream’s Stream Computing has no transactional bottleneck and is purely declarative.  Input streams are not “side-effected” by the execution of stream SQL statements, rather new streams are created from the original ones (which are left untouched and can be presented concurrently to other SQLstream servers).  The execution paradigm is one of parallel dataflow execution – a paradigm that lends itself not only to massive parallel execution but also to massively distributed execution.  I believe that as Hadoop becomes more widely understood and deployed, people will begin to see just how much of a better job could be performed by adding a little intelligence and just how powerful declarative stream computing can be.