Cloud Security
Cloud Analytics: Dataflow versus Databases
Realtime analytics drives a migration away from databases to more scalable parallel dataflow architectures.
Oct. 29, 2009 04:34 PM
For twenty years, analytics has been viewed as just one specific area within the broader relational database industry. So, analytics has meant databases. Today that view is changing. Over the past year or so, a new movement, the "NoSQL" movement has emerged promoting the advantages of doing a variety of kinds of analytics without using any relational database technologies at all.
Whatever one thinks of the capabilities and limitations of distributed key-value stores relative to relational databases, one thing is clear - the stranglehold that SQL has held over all aspects of data analytics since 1990 is now coming to an end. Other non-SQL approaches to analytics such as MapReduce/Hadoop, a very simple dataflow architecture for batch computing, are now gaining ground. As the need for realtime analytics grows we will continue to see a migration away from databases and towards more scalable parallel dataflow architectures for analytics.

The main differences between databases and dataflow can be summarized as follows:
|
Database
|
Dataflow
|
|
Historical
|
Realtime
|
|
Offline
|
Online
|
|
Pull Model
|
Push Model
|
|
High latency
|
Low latency
|
|
Demand-driven
|
Data-driven
|
The shift from databases to dataflow for enterprise cloud analytics mirrors what we have recently seen in another area, the "realtime web". The old demand-driven web model of polling/querying/pulling RSS feeds has proved unable to deliver the kinds of low latency required for the numerous new realtime web services being created by Twitter and others. New data-driven, realtime, push models such as PubSubHubbub and RSSCloud are now replacing the old approaches.
About Bill McCollBill McColl is Founder and CEO, Cloudscale Inc. - which is developing a massively parallel cloud-based platform for continuous real-time intelligence on live data streams.
In 2006, he left Oxford University Computing Laboratory where for over twenty years he had been head of research in parallel computing and scalable systems. At the time of his departure, he was Professor of Computer Science and Chairman of the Faculty of Computer Science. McColl has published and lectured extensively on the design, analysis and implementation of massively parallel algorithms and systems.
He established and led Oxford Parallel, a major center for research on industrial and business applications of parallel computing at the university. He was also founder and CEO of Sychron Inc., a Silicon Valley VC-backed software company developing massively parallel system software for datacenter and desktop virtualization. Cloudscale Inc.is his second Silicon Valley company.