Blog


CEO
April 21, 2013

SQLstream sponsored the recent IE Group Big Data Innovation Summit in San Francisco where I also presented on streaming SQL for Hadoop, and extending Hadoop for real-time operational intelligence. As Big Data technologies and Hadoop push further into mainstream enterprises, so the need for real-time business operations is an important parallel trend. ‘Real-time’ and ‘Hadoop’ had been considered synonymous by some, yet surprisingly, people are surprised when Hadoop does not seem to be as real-time as they hoped. This should not come as a surprise, as Hadoop as many strengths, but was never intended for low latency, real-time analytics over high velocity data.

SQLstream Hadoop-Innovation-Summit Real-time Hadoop

Click to View Damian’s Presentation on Slideshare

Real-time Big Data or real-time Big Data?

Which raises the question, what do we mean by real-time? Many products have emerging that claim ‘real-time’ analytics over Hadoop. Yet Hadoop remains a batch processing framework, and struggles to deliver low latency analytics against high velocity streaming data, struggling due to the same limitations as existing RDBMS-based data management platforms. These ‘real-time’ products may generate rapid results over the stored data, but ignore the latency introduced by data collection and storage, and also ignore the resource load of repeated execution of queries to process newly arriving data. The latency issue may not be apparent for slower data streams, such as twitter feeds for example, but with the data rates of machine data in the world of telecommunications, industrial automation, M2M and large scale security intelligence for example, the problem rapidly becomes extreme.

SQLstream’s core stream computing platform, s-Server, processes high velocity data as soon as they are generated, executing continuous SQL queries and analytics directly over log files, sensor feeds and any other machine-generated data source. We measure real-time form the time of data creation, eliminating completely the latency introduced by collecting, storing and the repeated updates of results.

Drive real-time actions with streaming operational intelligence

We discussed in a previous blog how real-time operational intelligence eliminates the chasm between business operations and analytics. Operational intelligence is about more than the collection and analysis of log file and machine-generated data. One of the advantages of stream computing is the ease with which predictive analytics can be applied over multiple data streams. This makes it possible to alert on time and space-based patterns of machine, user and consumer behavior that are predictors of some future event – a security breach, network failure or service fault.

streaming operational intelligence

And true operational intelligence platforms need to go one step further – true real-time platforms must do more than visualize results on a dashboard – it’s essential to connect back to application and operational systems, and to drive automated updates. Security breaches can be avoided, network resilience mechanisms activated, and service faults corrected before SLA breaches occur and customers are aware of the problem.

Real-time operational intelligence on Hadoop

So what does this mean for Hadoop? Streaming is not a new technology, but approaches streaming technologies have focussed on single source problems, and have been deployed as standalone platforms for low velocity use cases. With SQLstream, standard SQL queries, albeit continuously executing SQL queries, execute to join, group, partition and analyze real-time machine data streams. There is a further difference – SQLstream’s s-Server streaming SQL platform can also be deployed as a streaming SQL query extension for Hadoop.

A number of streaming Hadoop scenarios are supported:

  • Stream persistence – Hadoop HBase as an active archive for streaming data and derived intelligence using the Flume API. SQLstream also performs continuous aggregation  to support high velocity streams without data loss.
  • Stream replay – restream the complete history of persisted streams from HBase for ‘fast forwarding’ of time-based and spatial analytics. Various interfaces can be utilized, including Cloudera’s Impala.
  • Streaming data queries, joining streaming real-time data with historical streams and intelligence persisted in HBase.

Making the Elephant fly

Accelerating Hadoop to process live, high velocity unstructured data streams delivers the low latency, streaming operational intelligence demanded by today’s real-time businesses. Hadoop has been the driving force behind Big Data Analytics but as the technology hits the mainstream, many industries are seeking to take a step further and eliminate latency from their business completely. With the SQL language emerging as the key enabler for the mainstream adoption of Hadoop, executing streaming SQL queries over Hadoop extends the platform out to the edge of the network, making it possible to query unstructured log file, sensor and network machine data sources on the fly and in real-time.

Posted under Big Data · Events

CEO
March 20, 2013
  • SQLstream s-Streaming Big Data Engine Benchmarks at 1.35 Million Streaming Events Per Second per 4-core server  – Outperforming Twitter’s Storm Stream Computation Project with Significant Overall TCO Advantage

New York, NY | March 20, 2013– SQLstream Inc., the Streaming Big Data Company, announced today at GigaOM’s Structure:Data, the results of an independent performance benchmark which measured the SQLstream s-Server 3.0 Big Data Engine processing 1.35 million 1Kbyte records per second per 4-core commodity server, outperforming a comparable configuration based on the Twitter Storm distributed real-time computation system. SQLstream’s s-Server outperformed the Storm-based solution by a factor of 15x.

SQLstream’s s-Streaming Big Data Engine delivers action-oriented analytics, extracting operational intelligence in real-time from high velocity, unstructured log file, sensor and other machine-generated data. Streaming intelligence can be persisted, queried and replayed in Hadoop, with additional connectors to all major storage platforms and data warehouses.

The streaming Big Data benchmark was conducted by a large enterprise with a roadmap to stream unstructured operational data from multiple remote log and machine data flows at up to 10 million records per second for each installation. The benchmark requirement was to perform advanced time-series analytics over mobile network infrastructure records in order to predict potential service-impact problems. The benchmark projects that the s-Server platform would require just eight servers to scale up to 10 million records per second — versus an estimated more than 110 servers for the comparable Storm approach.

SQLstream s-Server 3.0 was able to demonstrate significant cost savings with dramatically lower TCO. The TCO savings came from a combination of reduced hardware and power consumption, the power and simplicity of SQL over low-level Java development, plus reduced maintenance requirements. Other factors influencing SQLstream s-Server’s TCO advantage came from its integrated Big Data platform architecture, ability to update on the fly as new data flows are incorporated, significantly faster implementation timescales using SQL for streaming analytics and integration, and automatic platform optimization for turbo-charged performance and parallel dataflow execution.

“SQLstream excels through the combination of its mature, industry-strength streaming Big Data platform, our support for standard SQL (SQL:2008) for streaming analysis and integration, plus a flexible adapter and agent architecture,” said SQLstream CEO Damian Black. “SQLstream s-Server is today’s clear streaming performance winner – with blazingly fast throughput, an ability to handle a wide variety of message types, sources and formats, and an efficient Streaming Data Protocol with compact optimized binary data formats.”

Advantages of SQLstream’s s-Server, the core element of the company’s s-Streaming Big Data Engine, as demonstrated in the performance benchmark project include:

  • Scaling to a throughput of 1.35 million 1Kbyte records per second per four-core server each fed by twenty remote streaming agents.
  • Expressiveness of the standards-based streaming SQL language with support for enhanced streaming User Defined Functions and User Defined Extensions (UDF/UDX).
  • Deploying new processing analytics pipelines on the fly without having to stop and recompile or rebuild applications.
  • Advanced pipeline operations including data enrichment, sliding time windows, external data storage platform read and write, and other advanced time-series analytics, all based on existing SQL standards.
  • Advanced memory management, with query optimization and execution environments to utilize and recover memory efficiently.
  • Higher throughput and performance per server for lower hardware requirements, lower costs and simple to maintain installations.
  • Proven and mature enterprise-grade product with a validated roadmap and controlled release schedule.

All required modules used in the benchmark were integrated with s-Server 3.0, using 20 remote streaming agents connected per SQLstream s-Server instance each running on a four-core Intel® Xeon© server platform with RedHat Enterprise Linux.

 

About SQLstream

SQLstream (www.sqlstream.com) is the pioneer and innovator of a patented Streaming Big Data Engine that unlocks the real-time value of high-velocity unstructured machine data. SQLstream’s
s-Streaming products put “Big Data on Tap™ – enabling businesses to harness action-oriented and predictive analytics, with on the fly visualization and streaming operational intelligence from their log file, sensor, network and device data. SQLstream’s core V5 streaming technology is a massively scalable, distributed platform for analyzing unstructured Big Data streams using standards-based SQL, with support for streaming SQL query execution over Hadoop/HBase, Oracle, IBM, and other enterprise database, data warehouse and data management systems.  SQLstream’s headquarters are in San Francisco, CA.

Posted under Events · Press Releases

CEO
July 1, 2010

GigaOM Structure 2010 Big Data and Cloud ComputingLast week I was on a panel for “Big Data” at Structure2010 – a GigaOm event. As usual, it was very well run and there was a large throng of silicon valley luminaries ranging from entrepreneurs to venture capitalists scattered in with some large customers and users of technology. We clearly have moved on a long way from the days when I was told to change my slides and remove the cloud graphic and replace it with a box because “clouds are cloudy” (direct quotation from a tier one venture capitalist – I wish to protect his identity to avoid personal embarrassment).

SQLstream is already the market leader in applying stream computing to Intelligent Transportation Systems, and we also have the opportunity to provide a similar impact to the Cloud Computing Service Monitoring space. It seems we have exactly the perfect solution to provide real-time insights into service usage, bottlenecks, error rates and service level compliance. And you can add regulatory compliance to that list too – from the continuous alerting side to complement the excellent historical solutions that are out there.

From the presentations at the show, it is clear that Cloud Computing has truly come of age. SQLstream uses cloud services for all demonstrations and also in our QA and Engineering processes. We also have customers deploying in the cloud. The latest emerging cloud solutions fill in many of the former technology gaps, allowing seamless integration into or transition from traditional data centers. You can even run your own private clouds leveraging the same APIs available on the public clouds.

On the Big Data front, on the panel alongside SQLstream were a Hadoop vendor and a high-performance column store data warehouse vendor. The other two panelists were users of “big data” technologies. It was interesting to discover that we already had two implementations where SQLstream operates in concert with or in parallel with the other two panelist vendors’ technologies.

There is even a customer (Mozilla) that uses all three technology approaches for download analytics – Hadoop in the form of HBase and a column store data warehouse for historical SQL queries over downloads, and SQLstream to generate high-performance continuous real-time analytics and reporting on download statistics for all versions of FireFox. This clearly demonstrates that there is a role for each of the Big Data technologies high-lighted on the panel, and an interesting and growing market opportunity. It also indicates some clear partnership opportunities.

I look forward to seeing the developments in our space and in cloud computing over the coming year and hope to be invited back again soon. We were originally present on the Big Data panel at GigaOm’s inaugural Structure2008 event, so I guess we should be set for a reappearance at Structure2012?! If so, I am sure we will have some exciting new stories to share.

Here is a link to the video recording of the panel session. A big thank-you to Phil Hendrix for his excellent moderation of the panel and the professional preparation work he did beforehand so that the actual event went smoothly.

Posted under Big Data

CEO
June 22, 2010

GigaOM Structure 2010 Big Data and Cloud Computing There is a lot of buzz these days about the challenge of “Big Data”.  I’ll be speaking on the subject at GigaOM’s Structure2010, on the “DEALING WITH THE DATA TSUNAMI: THE BIG DATA” panel. There are many dimensions to the challenges posed by “Big Data”, which I’ve presented here as five separate but related themes.

Speed of data arrival

The first theme is speed.  When a lot of data arrive fast, it is often overlooked that they arrive in raw form and need to be processed or cooked before they can be of any real value. The processing normally comprises cleaning, filtering, aggregating and validating.  Sometimes the data need to be enhanced, normalized or de-normalized.  While there are a number of proprietary ETL tools out there that can help, most people prefer to perform these operations using SQL.  This approach has become known as ELT as the data are Extracted, Loaded and then Transformed (as opposed to Transformed then Loaded).  In the past, this has meant loading raw data into a data warehouse’s staging tables and then performing the ELT with SQL in batches until the data are fully cooked and ready to take part in the “main course” queries.

One of the strengths of the SQLstream approach is that for the first time you can use standards-based SQL for performing these ELT steps but as Continuous ETL rather than operating upon the data after first storing it.  We call this “analyze-before-store” approach: Query the Future – as the scope of the continuous queries is from the moment they start until the end of future time (in contrast with historical queries whose scope is from the moment they start until as far back in time as the data are stored).  SQLstream’s queries continuously process, clean, aggregate and enhance the data in a highly parallelized dataflow pipelined process.  The staging is in main memory using 64-bit architecture and multiple cores and servers.  This provides a highly scalable efficient and cost effective solution to ETL, with the virtuous side-effect of enabling the data warehouse to be kept continuously up-to-date by feeding it a stream of fully cooked data and updating its aggregate tables continuously in near real-time.  All of this is done without stealing valuable cycles of the data warehouse server.

Data location

The second theme is data location.  Like houses, location is very important when it comes to assessing the value (or usefulness) of the data.  Location might be spatial or temporal.  If you wish to be alerted of a special price for gas at a specific gas station, clearly it is of greater value if you are currently in the immediate vicinity of the gas station.  This shows the value of both the location in space and the location in time.  In contrast, most data warehouses dumbly store all service data and records without regard to their value.

Clearly, the value of the data in many cases greatly diminishes over time.  Many of the queries that a business might pose are better targeted at current data.  That is particularly true of targeted advertisements, but also when monitoring customer service level, cloud computing infrastructure and the like.  The data are much more valuable when the business is able to take proactive initiative to capitalize on the value – fixing problems or issues before they negatively impact customers, or making that promotion or sale before the customer purchases product or service from a competitor.  SQLstream’s continuous queries are all about focusing analytics where they have the most value by specifying explicit windows of focus for the queries in terms of time, quantity or space.  While many rows can flow into and out of the window of focus for any given query, the window represents the immediate focus of attention.

Pace of change

The third theme is the pace of change of data.  If you have a large quantity of data that is not changing very much, then historical queries and analysis will no doubt provide you with all of your answers.  However, if the data are changing constantly, or a lot of new data arriving constantly, or if you have a focus on a specific window of time or space, then historical analysis has little value.  What you care about is the derivative of the change – the rates of change.  For example, are our sales accelerating or decelerating?  Is the rate of acceleration unusually high or low?  What about service outages and error rates?  Or customer complaints?  The SQLstream approach enables you to see what is changing rather than what is staying the same.  It is analogous to predator vision: the predators want to see what is moving and their vision system prioritizes that over what remains motionless.  SQLstream provides such dynamic vision.

Balancing historical and continuous analysis

The fourth theme is the need to complement data mining and the results of historical analysis with continuous analysis.  Data warehousing allows you to find patterns and predictors from past data and to back test all of your hypotheses over extended periods of time.  The back testing of such hypotheses often takes the form of SQL queries that search for patterns of changes of data over time and check that the predicted results occurred and with what frequency.  Once you have mined and captured such valuable predictors, it is straightforward to take the SQL you have generated and tweak it to be used in real-time, continuously executed against live data.  Using this approach, SQLstream allows you to leverage you data mining results to perform real-time predictive analytics, giving your business a real-time heads up for key indicators of buying signals, or systems’ failure or what ad should be served up based on a customer’s web behavior.

Brain over brawn processing

My fifth and final theme is “smart declarative” versus “dumb brute force” when applied to data queries.  The latter is how I see Hadoop-based approaches.  You parallelize a problem to take advantage of a lot of available servers and related CPU cycles, but you do not rely on any intelligence on how you partition the problem.  In fact not having to “think” is one of the primary appeals of the technique.  It is a brute force method of brawn over brain.  However, where the problem space is truly huge, or the time or financial budget is more limited, there is always the attraction of the “brain over brawn” technique.  Declarative SQL processing draws upon the mathematical tractability of analyzing patterns and dependencies within the data, the use of keys and indexing, the rewriting of complex formulae into simpler ones and avoiding recalculation of intermediate results – in order to provide a faster, more efficient and smarter way of finding the solutions.  Such declarative techniques can still take extensive advantage of parallelism and inexpensive or available servers and CPU cycles, but they rely on smart analysis in order to optimize the calculations.  SQLstream, and all SQL-based data warehouses, heavily draw upon these mathematical SQL properties and patterns and analysis of the data to do the smart thing when it comes to query processing.

Stream Computing of the kind embodied by SQLstream however has even greater potential to take advantage of parallelism over and above SQL data warehouses because SQLstream’s Stream Computing has no transactional bottleneck and is purely declarative.  Input streams are not “side-effected” by the execution of stream SQL statements, rather new streams are created from the original ones (which are left untouched and can be presented concurrently to other SQLstream servers).  The execution paradigm is one of parallel dataflow execution – a paradigm that lends itself not only to massive parallel execution but also to massively distributed execution.  I believe that as Hadoop becomes more widely understood and deployed, people will begin to see just how much of a better job could be performed by adding a little intelligence and just how powerful declarative stream computing can be.

Posted under Big Data