A rchive Date
[ 17-02-2005 ]
Category
[ Information Technologies ]
sub-Categoy
[ Computers ]
|
[http://www.eweek.com/article2/0,1759,1764741,00.asp
Database Legend: How Real-Time Data Analysis Will Transform Society
By Lisa Vaas
February 15, 2005
Mike Stonebraker is a database superstar. Not only is the former UC/Berkeley computer science professor the father of the popular relational databases Ingres and Postgres, he was also the founder of Illustra Information Technologies Inc., acquired by Informix, which in turn was acquired by IBM.
The next project for this database pioneer takes shape in the form of StreamBase Systems Inc., a company that's churning out software designed to process, analyze and act on real-time data "within milliseconds of its arrival." Stonebraker is StreamBase's founder and chief technology officer.A
StreamBase announced its Stream Processing Engine at the DEMOConference on Monday in Scottsdale, Ariz. eWEEK.com Database Editor Lisa Vaas recently got a chance to talk with Stonebraker about the issue of real-time data analysis, about how it leaves relational databases in its dust and, most importantly, how this cutting-edge technology is poised to transform our society. Financial services comes to mind, of course, but what really fires up Stonebraker are prospects like revolutionizing the care of emergency-room patients, the care of soldiers on the front lines or simply the ability to find your child when she's lost at Disney World.
You've said that streaming data on the fly is something that ordinary relational databases can't handle. Why?
Here's a quick, simple little problem. This was a pilot we were asked to do early on. [It was] a large, mutual funds company. They subscribe to every feed on the planet, [including feeds such as Reuters]. They have a current application that watches each feed to determine if the data is late, so they can say, "Don't trust Reuters now, the feed is screwed up."
They defined "late" as [when the] inter-arrival time of ticks between the same stocks is greater than a certain number. You see an IBM tick, and if you don't see another IBM tick in x seconds, it's an indication of late data.
They wanted to issue an alarm if you saw a late tick. Then they wanted to say, "If you see 100 late ticks that are coming from the feed vendor, then ring the red telephone."
The current application is written on top of bare metal in C++. They were unhappy with the performance of the current application, and it was hard to maintain. And expensive.
On this application, they said, "How fast can you go?" We processed about 150,000 messages per second on this, on a $1,500 PC, a commodity piece of hardware. Their current production application does about 3,000 messages per second. The best we could get out of one of the very popular relational databases was 900 messages per second.
Elephants store data
In round numbers, we're two orders of magnitude faster than the elephants. And the two orders of magnitude are on identical hardware. If you normalize for clock speed of our production application vs. theirs, we're one order of magnitude faster.
What accounts for this speed gain?
There are three big reasons: One, the elephants store the data. There's no need to store the data. One of the characteristics of real-time, streaming data, it's like IT sushi. It has high value right now, and the value decays very quickly. There's no need to keep the data around for the long term in some sort of repository. That just takes up time, latency and resources to do that.
Reason No. 2 is when you're looking for the inter-arrival time between ticks, that's a time-series notion. When you're doing real-time stream processing, we have time-oriented primitives in the bottom of the screen. … We have extended SQL to something we call StreamSQL, which has extra stuff in it. … We've had to add another notion to SQL, the notion of time windows. You can do SQL-like calculations over time windows. Do them in real time as data is flying by. …
[Finally,] if you want to count to 100, which is what this [application] had to do in order to decide to ring the red phone, the most efficient way to do that is with four lines of C++. In this application, it makes sense to mix small amounts of code in a general-purpose environment with database-oriented processing steps. We can do that in our architecture: freely intermix C++ with our StreamSQL primitives. The relational guys all run client/server, and C++ code has to run in the client in a separate place from the server. So the client/server architecture slows you down on this style of application.
What types of enterprises need this type of fast analysis?
Financial services, industrial process control, monitoring oil refineries, the government: Military and homeland security is full of this style of application. We've been talking to one of the three-letter agencies. The guys who won't give you their business cards. They're monitoring Arabic chatter. When the czar of homeland security says, "The chatter has changed," there's a real-time system processing incoming feeds, computing statistics on incoming Arabic language streams, to actually determine that. They started yakking with us on piloting that application.
Another example: network monitoring, for DOS [denial of service] attacks. Fraud detection.
Financial firms seek to thwart identity theft.
Another very large financial services company is exploring piloting another application with us. They're terrified the really bad guys, who do credit card fraud and identity theft, will target financial services. This company wants to monitor their worldwide network and watch application-level events. For example, they want to watch every log-in to their systems and watch for suspicious events such as the same user logged in more than once from two IP addresses more than a mile apart.
RFID [radio frequency ID] must pose big opportunities for this type of real-time data analysis, right?
What's coming is a microsensor revolution. The cost of microsensors is being driven down at a vast rate. … One of my favorite applications: I have kids, I've taken them to Disneyland and Disney World. It's a stressful situation. It's a crowded place, and you don't want to lose your kids, and it's awfully easy to lose them. The paper wristband you wear will turn into an electronic tag, and that will allow parents to dock at a kiosk so you can say, "Exactly where are my kids, so I can go get them?"
Another example: Mass General Hospital in Boston is very interested in getting hospital personnel to wear electronic tags. If there's a code blue, now, they issue a global alarm, and everybody lines up at the door of the person who has the emergency. If they knew where everybody was, they can dispatch the right person more efficiently.
The military is very interested in tagging all soldiers and all vehicles [so they can] monitor medical vital signs in real time.
There will be incredible social good from medical monitoring that will be possible from wireless technology downstream of cheap microprocessing technology.
The current database vendors are all selling one-size-fits-all, with a single engine being good for everything. I think at least in streaming data it isn't true, since there's just a huge performance problem with the one-size-fits-all model. … The one-size-fits-all paradigm is getting stretched. It will be interesting to see how in unfolds in the next few years.
Copyright © 1996-2005 Ziff Davis Publishing Holdings Inc. All Rights Reserved. eWEEK and Spencer F. Katt are trademarks of Ziff Davis Publishing Holdings, Inc. Reproduction in whole or in part in any form or medium without express written permission of Ziff Davis Media Inc. is prohibited.]
|