Blog


CEO
April 21, 2013

SQLstream sponsored the recent IE Group Big Data Innovation Summit in San Francisco where I also presented on streaming SQL for Hadoop, and extending Hadoop for real-time operational intelligence. As Big Data technologies and Hadoop push further into mainstream enterprises, so the need for real-time business operations is an important parallel trend. ‘Real-time’ and ‘Hadoop’ had been considered synonymous by some, yet surprisingly, people are surprised when Hadoop does not seem to be as real-time as they hoped. This should not come as a surprise, as Hadoop as many strengths, but was never intended for low latency, real-time analytics over high velocity data.

SQLstream Hadoop-Innovation-Summit Real-time Hadoop

Click to View Damian’s Presentation on Slideshare

Real-time Big Data or real-time Big Data?

Which raises the question, what do we mean by real-time? Many products have emerging that claim ‘real-time’ analytics over Hadoop. Yet Hadoop remains a batch processing framework, and struggles to deliver low latency analytics against high velocity streaming data, struggling due to the same limitations as existing RDBMS-based data management platforms. These ‘real-time’ products may generate rapid results over the stored data, but ignore the latency introduced by data collection and storage, and also ignore the resource load of repeated execution of queries to process newly arriving data. The latency issue may not be apparent for slower data streams, such as twitter feeds for example, but with the data rates of machine data in the world of telecommunications, industrial automation, M2M and large scale security intelligence for example, the problem rapidly becomes extreme.

SQLstream’s core stream computing platform, s-Server, processes high velocity data as soon as they are generated, executing continuous SQL queries and analytics directly over log files, sensor feeds and any other machine-generated data source. We measure real-time form the time of data creation, eliminating completely the latency introduced by collecting, storing and the repeated updates of results.

Drive real-time actions with streaming operational intelligence

We discussed in a previous blog how real-time operational intelligence eliminates the chasm between business operations and analytics. Operational intelligence is about more than the collection and analysis of log file and machine-generated data. One of the advantages of stream computing is the ease with which predictive analytics can be applied over multiple data streams. This makes it possible to alert on time and space-based patterns of machine, user and consumer behavior that are predictors of some future event – a security breach, network failure or service fault.

streaming operational intelligence

And true operational intelligence platforms need to go one step further – true real-time platforms must do more than visualize results on a dashboard – it’s essential to connect back to application and operational systems, and to drive automated updates. Security breaches can be avoided, network resilience mechanisms activated, and service faults corrected before SLA breaches occur and customers are aware of the problem.

Real-time operational intelligence on Hadoop

So what does this mean for Hadoop? Streaming is not a new technology, but approaches streaming technologies have focussed on single source problems, and have been deployed as standalone platforms for low velocity use cases. With SQLstream, standard SQL queries, albeit continuously executing SQL queries, execute to join, group, partition and analyze real-time machine data streams. There is a further difference – SQLstream’s s-Server streaming SQL platform can also be deployed as a streaming SQL query extension for Hadoop.

A number of streaming Hadoop scenarios are supported:

  • Stream persistence – Hadoop HBase as an active archive for streaming data and derived intelligence using the Flume API. SQLstream also performs continuous aggregation  to support high velocity streams without data loss.
  • Stream replay – restream the complete history of persisted streams from HBase for ‘fast forwarding’ of time-based and spatial analytics. Various interfaces can be utilized, including Cloudera’s Impala.
  • Streaming data queries, joining streaming real-time data with historical streams and intelligence persisted in HBase.

Making the Elephant fly

Accelerating Hadoop to process live, high velocity unstructured data streams delivers the low latency, streaming operational intelligence demanded by today’s real-time businesses. Hadoop has been the driving force behind Big Data Analytics but as the technology hits the mainstream, many industries are seeking to take a step further and eliminate latency from their business completely. With the SQL language emerging as the key enabler for the mainstream adoption of Hadoop, executing streaming SQL queries over Hadoop extends the platform out to the edge of the network, making it possible to query unstructured log file, sensor and network machine data sources on the fly and in real-time.

Posted under Big Data · Events

CEO
March 20, 2013
  • SQLstream s-Streaming Big Data Engine Benchmarks at 1.35 Million Streaming Events Per Second per 4-core server  – Outperforming Twitter’s Storm Stream Computation Project with Significant Overall TCO Advantage

New York, NY | March 20, 2013– SQLstream Inc., the Streaming Big Data Company, announced today at GigaOM’s Structure:Data, the results of an independent performance benchmark which measured the SQLstream s-Server 3.0 Big Data Engine processing 1.35 million 1Kbyte records per second per 4-core commodity server, outperforming a comparable configuration based on the Twitter Storm distributed real-time computation system. SQLstream’s s-Server outperformed the Storm-based solution by a factor of 15x.

SQLstream’s s-Streaming Big Data Engine delivers action-oriented analytics, extracting operational intelligence in real-time from high velocity, unstructured log file, sensor and other machine-generated data. Streaming intelligence can be persisted, queried and replayed in Hadoop, with additional connectors to all major storage platforms and data warehouses.

The streaming Big Data benchmark was conducted by a large enterprise with a roadmap to stream unstructured operational data from multiple remote log and machine data flows at up to 10 million records per second for each installation. The benchmark requirement was to perform advanced time-series analytics over mobile network infrastructure records in order to predict potential service-impact problems. The benchmark projects that the s-Server platform would require just eight servers to scale up to 10 million records per second — versus an estimated more than 110 servers for the comparable Storm approach.

SQLstream s-Server 3.0 was able to demonstrate significant cost savings with dramatically lower TCO. The TCO savings came from a combination of reduced hardware and power consumption, the power and simplicity of SQL over low-level Java development, plus reduced maintenance requirements. Other factors influencing SQLstream s-Server’s TCO advantage came from its integrated Big Data platform architecture, ability to update on the fly as new data flows are incorporated, significantly faster implementation timescales using SQL for streaming analytics and integration, and automatic platform optimization for turbo-charged performance and parallel dataflow execution.

“SQLstream excels through the combination of its mature, industry-strength streaming Big Data platform, our support for standard SQL (SQL:2008) for streaming analysis and integration, plus a flexible adapter and agent architecture,” said SQLstream CEO Damian Black. “SQLstream s-Server is today’s clear streaming performance winner – with blazingly fast throughput, an ability to handle a wide variety of message types, sources and formats, and an efficient Streaming Data Protocol with compact optimized binary data formats.”

Advantages of SQLstream’s s-Server, the core element of the company’s s-Streaming Big Data Engine, as demonstrated in the performance benchmark project include:

  • Scaling to a throughput of 1.35 million 1Kbyte records per second per four-core server each fed by twenty remote streaming agents.
  • Expressiveness of the standards-based streaming SQL language with support for enhanced streaming User Defined Functions and User Defined Extensions (UDF/UDX).
  • Deploying new processing analytics pipelines on the fly without having to stop and recompile or rebuild applications.
  • Advanced pipeline operations including data enrichment, sliding time windows, external data storage platform read and write, and other advanced time-series analytics, all based on existing SQL standards.
  • Advanced memory management, with query optimization and execution environments to utilize and recover memory efficiently.
  • Higher throughput and performance per server for lower hardware requirements, lower costs and simple to maintain installations.
  • Proven and mature enterprise-grade product with a validated roadmap and controlled release schedule.

All required modules used in the benchmark were integrated with s-Server 3.0, using 20 remote streaming agents connected per SQLstream s-Server instance each running on a four-core Intel® Xeon© server platform with RedHat Enterprise Linux.

 

About SQLstream

SQLstream (www.sqlstream.com) is the pioneer and innovator of a patented Streaming Big Data Engine that unlocks the real-time value of high-velocity unstructured machine data. SQLstream’s
s-Streaming products put “Big Data on Tap™ – enabling businesses to harness action-oriented and predictive analytics, with on the fly visualization and streaming operational intelligence from their log file, sensor, network and device data. SQLstream’s core V5 streaming technology is a massively scalable, distributed platform for analyzing unstructured Big Data streams using standards-based SQL, with support for streaming SQL query execution over Hadoop/HBase, Oracle, IBM, and other enterprise database, data warehouse and data management systems.  SQLstream’s headquarters are in San Francisco, CA.

Posted under Events · Press Releases

Vice President Marketing
January 17, 2013

Bloor Group’s Robin Bloor hosted SQLstream’s CEO Damian in The Briefing Room on Tuesday January 8th 2013. The webcast, entitled “Windows of Opportunity: Big Data on Tap” focussed on the emergence of both SQL and the streaming data platform as a key enabler for real-time Big Data solutions in an ever-maturing marketplace. You can watch the full webinar from the link below, but I’m going to focus on some of the topics arising from the online discussion between Robin, Damian and the audience.

It was an interesting discussion, covering Big Data and streaming in the wider context of enterprise deployments, but a number of important points were raised:

Hadoop is a data reservoir, not a real-time platform.

Many believe incorrectly that Hadoop is a platform for real-time low latency analytics.  It’s not. Hadoop is a multi-purpose engine but not a real-time, high performance engine. The parallelism of Hadoop is great for processing the data once it’s stored, but has high throughout latency.  However, with the integration of a streaming data platform for continuous data collection, analysis and streaming integration, Hadoop can be used as the active archive for a true real-time, streaming Big Data system.

Operational intelligence needs a Streaming Big Data Platform

The bulk of real-time operational intelligance today is derived from log and machine data, data generated by the Internet, Cloud infrastructure and applications for example. There are many log monitoring tools out there, and while very capable, we’re finding that SQLstream with our real-time streaming Big Data platform is being used to solve the high volume, high velocity, complex data problems that log monitoring tools are unable to address at an affordable price point.

The emergence of SQL for Big Data

The first phase of Hadoop and Big Data platforms saw the emergence of NoSQL data storage platforms, looking to overcome the rigidity of normalized RDBMS schemas. However, as the technology hits mainstream industry, the need for simpler, high performance and reliable queries is driving a resurgence in SQL as the de facto language for Big Data processing (see Cloudera Impala for example). What’s not apparent is that SQL is the ideal language for processing data streams using real-time, windows-based queries. The issue with normalization and rigid schemes is a non-issue for a streaming data platform – there are no tables, no data gets stored!

So in summary, streaming Big Data is the emerging technology for 2013. And SQL is the (re-)emerging technology as Big Data hits mainstream industry.  Processing real-time log and machine data streams is a key requirement today, but industry with sensor, M2M and telematics applications are catching up fast.

 

Posted under Big Data · Events

Vice President Marketing
January 10, 2013

The SQLstream Briefing Room webinar with Robin Bloor took place on Tuesday January 8th 2013.  ”Windows of Opportunity: Big Data on Tap”  highlighted how the evolving Big Data landscape needs technologies that enable a much bigger enterprise-wide picture, complete with multiple data streams that can be combined to show what’s happening in real-time. The speakers included:

  • Eric Kavanagh, CEO, The Bloor Group, who hosted the event.
  • Robin Bloor, Chief Analyst, The Bloor Group, who lead the online briefing
  • Damian Black, President & CEO, SQLstream, discussing the emergence of streaming Big Data management as a key enabler for Big Data solutions, and how SQLstream is at the forefront of streaming innovation.

This was a very interesting and informative briefing on the emergence of streaming Big Data management and the use cases for real-time Big Data solutions.

Click here to watch the webcast …

Title: Windows of Opportunity: Big Data in Tap
Most business opportunities are moving targets these days, rendering static analytical solutions rather ineffective. Instead, organizations need technologies that enable a much bigger picture, complete with multiple data streams that can be combined to show what’s happening in real-time. And increasingly, companies need to analyze both traditional structured data as well as Big Data, including machine-generated data from all manner of enterprise systems.

Register for this episode of The Briefing Room to hear veteran Analyst Robin Bloor explain how a confluence of market forces has opened the door to a new analytical paradigm, one in which companies can leverage a vast array of data streams to pinpoint windows of opportunity as or even just before they appear. Bloor will be briefed by Damian Black of SQLstream, who will discuss his company’s analytical platform, which enables the management of dynamic information assets in much the way that traditional databases do for stored assets.

Posted under Events

Vice President Marketing
October 11, 2012

Perhaps the highlight of Oracle OpenWorld last week, or at least, the most commented on by attendees at our booth, seemed to be Larry Ellison’s demo of Exadata and Exalytics – querying 10 days or so of stored twitter feeds with the hope of finding the best US athlete from the recent London 2012 Olympics to endorse a car company. This seemed to strike a chord with the audience. How many organizations employ a marketing analytics company to spend a vast amount of time poring over data to work out the top candidates for a marketing campaign? That said, would the CMO really go with a query result, or chose their favorite in any case?

Cloud and business applications were a focus, although as others have blogged elsewhere, despite 80+ acquisitions in the past few years, Oracle remains a database company. Major announcements / news included:

  • Release of Oracle 12c (the ‘c’ for ‘cloud’), and the announcement of its first multi-tenanted and ‘pluggable’ databases got a few ripples of applause from the audience.
  • Exadata X3 box, the in-memory machine with 22 raw TB of memory and a claimed 10X compression making a total of 220TB of ‘memory’ in a rack. Oracle claims this is 100 times faster that the Exadata Oracle launched in the last few years.

Streaming analytics and Twitter

Back to Larry’s Twitter example. Of course, this can be achieved easily as a streaming application in real-time. Semantic streaming is something SQLstream’s been doing for some time, taking unstructured data such as twitter, emails and texts and determining sentiment and aggregated scoring in real-time. Use cases include identifying traffic incidents on the road networks to augment geospatial analytsis of vehicle GPS data, and also in telecommunications, to better determine in real-time a customer’s true perception of their quality of experience for delivered services.

The numbers seemed impressive – Larry crunched nearly five billion tweets and 27 billion social media relationships. But breaking this down, is this really a Big Data problem? Five billion tweets, even over a one day period (I believe the demo was 10 days), is only 58,000 tweets per second. This is well in access of Twitter’s top peak loads during major events such as the Superbowl. But well within the capability of SQLstream’s real-time streaming Big Data platform, even on an entry level single server, 2-core machine. Of course, the complete solution architecture may include data storage platforms such as Hadoop or Oracle, where aggregated streaming results can be loaded and persisted in real-time, further crunched in the data warehouse, and historical analysis joined back with the real-time streams to help identify better any moving trends.

It was an interesting demo nonetheless, and one that really should be completed in real-time as a streaming problem. SQLstream’s ability to analyze and aggregate streams across in this case keywords and hashtags, provide geospatial and clustering analysis, as well as delivering raw and aggregated data as continuous streams to the backend storage platforms, makes this very achievable today.

On the show floor

Oracle OpenWorld Speaking RobotApart from the heavy footfall at the SQLstream booth, perhaps most notable was the increasingly uninventive marketing mechanisms used to persuade unsuspecting attendees to listen to product pitches based on the promise of winning a piece of Apple hardware. Surely marketing managers can think up something a bit more inventive than an iPad? The exception was the the speaking robot. Not sure if this was an exhibit floor attraction, although I saw it ‘chatting’ to passersby on the Wipro booth.

Contact us if you’d like to find out more about SQLstream and our streaming Big Data management platform.


Vice President Marketing
April 19, 2012

Joining real-time structured and unstructured data feeds for better accuracy and reliability from your operational intelligence, and the Text Analytics Summit, 2012, London.

Three IT trends have emerged over the past year – Big Data, real-time and the importance of unstructured data. Taking the latter first, there is an increasing awareness that much of the data we have available to us today is unstructured (Cloudera amongst the many claiming 80% of all data is unstructured).  Unstructured data includes text messages, documents, tweets emails and video content. There’s also a growing industry for tools and software that perform unstructured data analytics – primarily text analytics using semantic modeling, tagging and subsequent analysis.

The past year has also seen Big Data and Hadoop emerge from the rarefied atmosphere of California’s Silicon Valley into mainstream IT.  Driven by statistics such as 90% of all data available today has been generated in the past two years, Big Data as a functional area for primarily unstructured data is here to stay, and is effectively supercomputing lite for the masses.

The need for real-time streaming data management

However, the real-time trend is less well served today by either Hadoop or by the currently available tools and software for unstructured data analytics. Real-time is about the need for immediate detection and response – turning data sources into live data feeds, and processing the data on the fly, then loading batch based distributed platforms such as Hadoop as an output data stream.

‘Stream Reasoning’

I’ve also seen the term ‘stream reasoning’ used to describe the real-time processing of unstructured data, although this is still an area that is less well developed and understood than the more mainstream text analytics from stored data.  ‘Streaming Reasoning’ is the ability to process and respond to semantic knowledge about tweets, messages and other social media interaction in real-time, on the fly. The diagram below illustrates how a semantic modeling library has been plugged into a real-time streaming pipeline in SQLstream – the example is based on SQLstream’s GATE UDX but any library with reasonable performance and a query response API can be plugged in.

Combining streaming structured and unstructured live data feeds

Unstructured data feeds, such as text messages and tweets, are streamed through the semantic tagging UDX and library, with the output of this stage being real-time streams of semantic tagged data.  The data can then be analyzed and frequency charted in real-time.

Text Analytics Summit, 2012, London

I’ll be speaking on this topic at the  Text Analytics Summit, 2012, London.  I’ll be discussing how to combine streaming reasoning (admittedly, mostly Twitter messages) with structured data, with the objective of improving the overall accuracy and reliability of the resulting operational intelligence.  I’ll be using a couple of examples – customer experience management for IP content services such as VoIP and VoD, and also improving the accuracy and reliability of traffic congestion information and travel time information – how can text analysis of tweets and messages help to pinpoint the severity of road network traffic problems.

Look forward to seeing you there, or if you can’t make, I’ll be blogging on the highlights next week.

 


Vice President Marketing
April 2, 2012

Last week SQLstream sponsored and CEO Damian Black presented at Structure Data in New York, a conference exploring “the technical and business opportunities spurred by the growth of big data”.

It’s clear that Big Data has moved on considerably in a very short space of time. From the Silicon Valley, 101 world of Java developers and Hadoop, into the mainstream wider business world (but still with Hadoop!).

Some themes emerging from the conference:

  • The basic need to deliver high performance, massively scalable computing infrastructure as data volumes grow exponentially. It’s clear that the pain from structured and unstructured data is driving different approaches at different stages in the data management lifecycle – better visualizations, better cleansing and filtering, and a better understanding of the appropriate analytics tools that are most applicable at each stage.
  • The emergence of the SQL layer. It’s clear Hadoop has its strengths and is here to stay. It’s effectively ‘supercomputing lite’ and given today’s data volumes, is just the tool for the job. However, there are a couple of trends emerging. First, is it actually necessary to store all the data, when much of it is obviously not of interest? Second, once the initial analysis of both all structured and unstructured data is achieved, there’s an emerging layer above Hadoop that’s looking very structured.  Both these functions are looking much more SQL-like.
  • Real-time, low latency analytics. Hadoop is not, nor does not claim to be, a low latency, real-time data management platform. There is a well-defined business need to analyze log file, sensor and network data in real-time (sub-second to a few minutes latency), but also to stream the arriving data through to Hadoop for further analysis. Obviously this layer needs to as scalable, if not more so, than the underlying Hadoop platform.

Damian’s presentation Structure Data focused on relational streaming – massive-scale parallel data processing using SQL, generating real-time results from streaming input data. The talk described relational streaming as a standalone real-time management layer, and also SQLstream integrated with Hadoop as the streaming layer in the Big Data stack (you can also read the GigaOM report in the presentation here).

 

Posted under Big Data · Events · In the News