Spark: Open Source Superstar Rewrites Future of Big Data

来源：互联网发布：李嘉诚马云知乎编辑：程序博客网时间：2024/06/09 15:57

From:

http://www.wired.com/wiredenterprise/2013/06/yahoo-amazon-amplab-spark/all/

Spark: Open Source Superstar Rewrites Future of Big Data

By Cade Metz
06.19.13
6:30 AM
Edit

Follow @cademetz

Students in the AMPLab at the University of California at Berkeley, birthplace of Spark.Photo: Ariel Zambelich/Wired

Ram Sriharsha works in the engine room powering one of Silicon Valley’s most influential companies. He’s an engineer at Yahoo.

Even after naming ex-Google star Marissa Mayer chief exec, Yahoo often is derided as a thing of the past, a fallen giant struggling to keep pace with the likes of Google, Facebook, and Twitter. Behind the scenes, though, thanks to people like Sriharsha, Yahoo is in many respects a step ahead of its much flashier competition — and has been for years.

Yahoo’s Sunnyvale, California headquarters is ground zero for Hadoop, an open source software creation that underpins a Who’s Who of the internet, including Facebook and Twitter. After reinventing not only the web butthe world of business software, the sweeping software platform — a means of crunching vast amounts of data across thousands of computer servers — is one of the great open source success stories of the past decade, and its influence is only expanding. But Yahoo, its founding father, is moving on.

Teaming with a particularly ambitious group of computer scientists from the University of California at Berkeley, Sriharsha is installing a new data crunching platform inside the massive data centers that drive Yahoo’s still enormous online empire. This software platform is called Spark, and according to those who built it and use it, it’s about 100 times faster than the mighty Hadoop — and could very well replace Hadoop as the stuff that fuels the modern web.

‘The goal is to build a new generation of data analytics software, to be used across academia and industry.’

— Ion Stoica

“The goal is to build a new generation of data analytics software, to be used across academia and industry,” says Berkeley professor Ion Stoica, part of the team behind Spark.

Little more than three years old, Spark is very much a fledgling technology. But as Yahoo takes the plunge, according to researchers at Berkeley, Amazon is kicking the tires on the platform. Chip maker Intel is helping expand and improve the project at a lab in China that typically feeds larger Chinese websites like Baidu and Tencent. And Facebook, another key force behind Hadoop, says it’s exploring the use of related software in the tools that helpdrive its everyday operations.

Part of the trick is Spark can store data in the memory subsystems of the thousand of servers it pulls together. Hadoop stores its data on good old fashioned hard disks, and grabbing data from memory requires far less time. But Spark also is what you might call a Swiss Army knife of Big Data analytics tools, says Reynold Xin, one of the Berkeley researchers who works on the project. Hadoop is often used in tandem with sister data analysis tools — tools that let you rapidly examine “real-time” data such as Tweets or ask questions of data via the familiar SQL query language — but Spark lets you do all this from a single piece of software.

“It works in a wide variety of ways,” Xin says, “and in some cases, it works better than systems optimized just for a specific task.”

The tool is still a long way from replacing Hadoop — and indeed, that may never happen. Twitter is using another software tool developed at Berkeley — aGoogle-mimicking contraption called Mesos — but has no plans to move from Hadoop to Spark. “The big uphill battle with things like Spark is that a lot of companies are pretty entrenched with existing tech,” says Twitter’s Ben Hindman, who helped build Mesos. “There is a huge Hadoop cluster here. I don’t even know how many machines.”

Yet Spark has a better chance than most. It, too, is open source software — and no less a name than Yahoo has already put its weight behind it.

Matei Zaharia (left) and Ion Stoica. Photo: Ariel Zambelich/Wired

The Superstar

The main brain behind Spark is Matei Zaharia, a Romanian-born graduate student who has spent the last few years atBerkeley’s AMPLab, a research operation dedicated to software that runs across tens of thousands of machines, aka “distributed software.” Working under another Romanian, Berkeley professor Ion Stoica, Zaharia was not only the main architect of the platform, but also the primary force behind the ongoing effort to push Spark onto the web and beyond.

In this way, he’s a bit like Doug Cutting, the man who famously founded the Hadoop project. But according to Xin, even this sells him short. “He’s a superstar — one of the smartest people I know and one of the hardest working,” Xin says. “I describe him as Ion Stoica and Doug Cutting in the same body. So, on the one hand you have this superstar researcher who has been publishing at top conferences and getting best paper awards, and on the other hand, you have this great open source guru that is building up an entire community.”

‘Spark is not just an in-memory system. It provides so much more. As researchers, we wanted to think ahead — to think about all sorts of things people will need years from now.’

— Matei Zaharia

The project began as a way of expanding the scope of Mesos. Designed by Zaharia, Ben Hindman, Ali Ghodsi, and a fourth Berkeley researcher, Andy Konwinski, Mesos is a means of running multiple distributed software platforms atop the same cluster of servers. Traditionally, you run a distributed system on one server cluster, and then, if you wish to run another, you set up a second cluster. But Mesos lets you run multiple systems — say, Hadoop and a platform like Storm, which rapidly examines “real-time” data along the lines of Tweets and other internet posts — atop one uber cluster. Spark began simply because the team needed something they could run atop Mesos.

“After Mesos, Matei looked around and said: ‘What do I do next, as an academic and someone who’s passionate about open source software?’” Konwinski remembers. “He made a real aggressive play by building a far easier and faster engine for Hadoop.”

The idea was to rebuild Hadoop from scratch, and shifting data from hard disks to memory was a natural move. But Zaharia and team went further, eventually building additional data analysis tools atop the platform. Hadoop often is used in tandem with Storm and distributed engines such as Hive, which let you slice and dice data via the SQL query language. But Spark is designed to mimic these tools directly, offering myriad possibilities from the same piece of software. Tools called Shark (analogous to Hive) to Spark Streaming (analogous to Storm) already run atop the platform.

“We’re betting that this thing will be the next software stack that integrates all these popular frameworks into one framework to rule them all,” says Konwinski.

What’s more, Zaharia and team sought to hone the Hadoop programming model. With Hadoop, you build data-crunching programs using the venerable Java programming language, but Spark also embraces Python and Scala, a newer language designed specifically for applications that operate across many machines, and it provides a set of pre-defined APIs, or application programming interfaces, for building new programs. “[These APis make] it so much easier to program,” Xin says. “Building a program with these APIs, for many, many servers, looks remarkably similar to what you would do to build a program for a single machine.”

Other tools share certain characteristics with Spark. Creations like Hana, from tech giant SAP, have moveddata analysis tasks into memory. And tools such asCloudera’s Impala and EMC’s Pivotal HD seek to improve the speed of SQL queries atop Hadoop. But no one provides that Swiss-Army-knife quality that Reynold Xin speaks of.

“Spark is not just an in-memory system,” says Zaharia. “It provides so much more. As researchers, we wanted to think ahead — to think about all sorts of things people will need years from now.”

Machine Learning Reborn

But that doesn’t guarantee success. In order to succeed, technology must be more than just effective. It must also have software developers — and big-name companies — behind the project. “You need people like Matei who have a passion for creating open source and are willing to man email lists and spend a lot of their lives getting people to use their software,” Konwinski says.

Spark hardly has support of Hadoop — no fewer than three companies sell their own versions of Hadoop and related software and services — but the AMPLab is at least on the way.

One new company, known as ClearStory Data, seems to be building some sort of commercial software platform that uses Spark. And the Spark open source project is on the verge of following Hadoop as anofficial project at the Apache Foundation, which gives added to weight to efforts to fashion a truly open software platform. But the biggest development may be Spark’s push into Yahoo.

‘Hadoop does a pretty terrible job with machine learning. Spark is good with logistic regression, and that can help with anything that involves a binary decision: Is this message spam? Should I show this ad to this user?’

— Reynold Xin

Yahoo is a web portal — a place where you visit web applications and sites — but also, like Google, an advertising company, and a platform like Spark is particularly suited to the advertising game. According to Yahoo’s Ram Sriharsha, the platform will provide a quicker means of determining which ads it should show to which visitors. “We’re in the process of putting it into production,” he says. “It will inform our data centers on how to get the best return on investment for our advertisers.”

Xin, who also is part of the Yahoo team that’s deploying Spark, says the company is particularly attracted to Spark because it’s suited to machine learning algorithms — algorithms that alter the way a computing system behaves based on the way it has behaved in the past. Machine learning algorithms involve crunching and re-crunching the same data — over and over again — in what’s called a “logistic regression.” With Hadoop, this can be particularly time-consuming because you have to visit the hard disk with each iteration of the algorithm. But with Spark, you can iterate in memory.

“Hadoop does a pretty terrible job with machine learning,” Xin says. “Spark is good with logistic regression, and that can help with anything that involves a binary decision: Is this message spam? Should I show this ad to this user?” Then, of course, the company can use the platform to rapidly analyze the vast amounts of data generated by services across the Yahoo empire.

Some will say that Google is still well ahead of both Yahoo and Spark. The search giant has built its own tools for quickly analyzing enormous amounts of data — most notably acreation called Dremel — but, as with Hadoop, Yahoo is taking a path that will end up benefiting more than just itself. Unlike Dremel, Spark is open source. Anyone can use it.

Spark may or may not be the future of Big Data. But the future is certainly open source.