Expert Interview Series – Part 2: Hortonworks Co-Founder and Technical Fellow, Owen O’Malley on the Origins of Hadoop
You haven’t heard it until you’ve heard it from the horse’s mouth! Here is an interesting one-on-one with Hortonworks co-founder and technical fellow, Owen O’Malley, one of the first to begin coding Hadoop. He’s still very much in it, and he had a lot to share when he sat down with Syncsort’s Paige Roberts at the last Strata event. After discussing the early origins of Hadoop, and the reasons why ORC files were invented in Part 1, he shared some surprising news about the origins of Spark, Tez, and the stunning performance you can get from Hive with ORC and the new LLAP technology.
What do you think about Spark?
Owen O’Malley: Spark is really amazing. It’s a very different paradigm than a lot of the traditional Hadoop stuff. I went to Berkeley and gave a talk to their machine learning group, so …
It’s your fault! [laughing]
It’s my fault, yeah. As a result of that talk, they ended up making Spark.
The only way to do [distributed machine learning] before then would’ve been to do it in MapReduce and then chain a bunch of MapReduce jobs together. Really, what Spark was made for was that kind of iterative algorithm where you need to load up the data, do some processing, and then send the results across to all the workers, and then repeat. In MapReduce, the only way to do that is to make a series of jobs that’s 50 or 100 long.
And really slow.
It’s going to be really slow because you have to do all the setup over and over again. It doesn’t work well. That’s really where Spark came from. So, especially for those iterative use cases, Spark does an amazing job. The thing Spark did really well is they have really nice developer APIs. If you’re programming against it, it’s got very, very nice APIs.
Which is how you end up with a nice ecosystem around it.
So, what about Tez?
Tez is very much the extension of MapReduce. It was our attempt to take the problems of MapReduce and the limitations of MapReduce and …
Yeah, fix them. When people say MapReduce is dead, well yeah, it was killed by Tez. Tez is an amazing thing.
Have you seen any of the LLAP demos? LLAP combined with Tez is really amazing. I was just seeing a demo today that combines LLAP and Tez. For a table with six billion rows, it’s coming back in under a second out of Hive.
You actually get very, very fast response. LLAP stands for “Live Long and Process.”
[Laughing] You jumped from Tolkien to Star Trek!
Yeah. I really need an orc doing the Vulcan hand sign just to tie the whole ORC and LLAP thing together.
Love that. [So, I created one.]
LLAP basically has the servers up and pre-warmed. Part of what we noticed is that spinning up a new JVM, not only does that spinning up process take a lot of time, it starts with the hotspot not being enabled. So, it’s not until after your process has been running a second that the hotspot compiler has gone in and optimized the parts of your Java executable that are getting run a lot.
You have both effects. It takes awhile to get going, and then you don’t run as quickly until the hotspot kicks in. LLAP fixes that because it leaves the servers up and shares them across users. It also caches in memory the files and columns, because it actually understands the columnar format that the data is in. Then the processing runs across the copy that’s in memory, if it has a copy. Otherwise, it will fetch it out of HDFS. So, you get these amazing speed-ups, especially when you combine Tez and LLAP together.
That’s amazing. What about streaming processing? I just saw the Capitol One presentation, and they have all of these processors they’re dealing with. It’s getting crazy out there.
[Laughing] It’s a very dynamic field. That’s part of what’s exciting, though. It’s really exciting watching the whole ecosystem evolve so quickly. That’s really how you see whether an ecosystem is healthy or not. Watching how much activity is happening. Are new products coming in that can do different things?
It’s vibrant and alive.
That’s what makes Hadoop such an exciting field to be in right now is watching all that activity.
With streaming, there are a lot of different contenders. I haven’t looked at them in a whole lot of detail, but I’m hearing really good things about Flink. We’ll see. Of course, with just general data in motion, Apache Nifi is a fascinating project.
It is. My friend Yolanda is working on it.
It’s headed in great directions.
Well, thanks for taking time to talk to me. I appreciate it.
See the results of Syncsort’s third annual Hadoop Survey in the free eBook: Hadoop Perspectives for 2017.