You haven’t heard it until you’ve heard it from the horse’s mouth! We brought you technical vision from Hortonworks CTO, Scott Gnau. Now, here is an interesting one-on-one with Hortonworks co-founder and technical fellow, Owen O’Malley, one of the first to begin coding Hadoop. He’s still very much in it, and he had a lot to share when he sat down with Syncsort’s Paige Roberts at the last Hadoop Summit.
Paige Roberts: I have a question that I’ve been asking everybody, and it’s always fun to get the answer, especially from someone with your background. What is Hadoop for?
[Laughs] Hadoop is for processing a lot of data on a lot of machines, all at once.
I think that’s the most succinct, straightforward answer I’ve gotten from anyone.
O’Malley: Well, I’ve been answering that question for ten and a half years now, so, I’ve had a lot of experience!
[Laughing] I know about your contributions to ORC and some other things like Hive, but you’re pretty much Hadoop ground floor. So talk to me about where you started and how you got here.
Okay. I was working in Yahoo! Search, the team that did the back-end search for building the index. We had a large graph of the entire Web that they knew about, where every node was a URL, and all the links between them were modeled as edges in the graph. We used that to build the index. It took 800 computers a month to build each iteration of the index, and we wanted to speed that up. That project was called WebMap in a Week. Google had written the papers about MapReduce and the distributed file system. We were like, “Okay, we want one of those.” So we started a project to code-up our own implementation, and we wanted to open source it. It was in C++ and we worked on it for about six months.
Then we saw the code that eventually became Hadoop, and so we dropped the code that we were doing and picked up Hadoop. Hadoop, at that point, ran on ten machines, if you were lucky. We picked up Hadoop because it would be easier to get Yahoo! legal permission to work on an open source project, because it’s already open source rather than to open source a brand new thing. So, that old code went away. We adopted Hadoop, and I’ve been working on Hadoop ever since.
When was this?
This was 2006. Ten years ago. Actually, I made a sandbox of what the code looked like back then, and some people have been downloading and playing with it. There are pieces of it you recognize right away, but then there’s a bunch of stuff that is very different. It’s much simpler. Back then, Hadoop consisted of like 5,000 lines of Java code for HDFS and 7,000 for MapReduce. The whole thing was a relatively small project. It was basically a grad student project. Then Yahoo! worked on it, and the community picked it up since then to make it into the large project that it is today.
Talk to me about ORC.
ORC came about because we were doing the Stinger project to speed up Hive by two orders of magnitude. A lot of that work involved vectorization and pipelining the operators, because traditionally database execution engines were row by row. They would feed each row into the pipeline and then send it to the pipe point, but that involved a lot of virtual dispatch calls, a lot of if-then-else’s. All that stuff stalls the operator pipeline on the CPU every time you do it. We wanted to rewrite the engine to use vectorization. But, down at the bottom level, the storage formats didn’t have enough information, at that point, to help do the vectorization. Also, to allow you to do things like predicate push-down.
In predicate push-down, the execution engine says, “Okay, I only need to see records that look like this. So, for example, I only want to see records where people have more than a million dollars in their bank account.” Or, if the file is sorted on the customer ID, saying, “I just need to see IDs in this range or this particular ID.” Then the file can take that predicate and say, “Okay, I only need this one chunk of the file”
I would think having the data in a columnar format would help as well to only select the fields you want.
Columnar is good, but there were already columnar formats.
Yeah, so that wasn’t new.
Right, that wasn’t a new thing.
The improved use of vectorization, as well as the understanding of the type, because you need to understand what type it is in order to build an index. You don’t want your index for time stamps to get confused, because sometimes it looks like a string. You actually need to know what type it is, and the field columnar formats were just treating it as a bunch of bytes for each type. So, they couldn’t build indexes and they couldn’t do the predicate push-down that we needed. That’s really where a lot of it came from.
What was the timeline on that? When did you come out with that?
I think it was three and a half years ago now.
So, it’s relatively young still.
In the grand scheme of things, yeah, it’s relatively young.
Except in the Hadoop ecosystem where that’s practically ancient.
Syncsort’s been doing a lot with HCat to ORC. How did HCat support get going?
We did HCat to ORC relatively quickly after ORC came out. Actually, if you’re still using that, you should probably move over to the more native interface, because HCat throws its own translation layer in the middle. So, you’ll get more performance if you go to ORC directly.
We’ve been considering that. One of the advantages of the HCat layer is the metadata. It’s really important to our customers to track that metadata.
The other thing that’s interesting about ORC is now there’s the C++ reader for ORC, which is significantly faster than even the Java reader. That’s another option, because you guys are in C++.
Syncsort is written in C++, yes. How do you use the C++ reader?
It’s just another library. It’s pure C++ and doesn’t reference the Java at all. It was originally done between Hortonworks and Vertica, because Vertica wanted to be able to access ORC files in HDFS from their Vertica engine without incorporating any Java into their process engine. It’s really hard to control the virtual memory space that Java allocates.
That makes sense.
We’re finding that a lot of partners with C++ engines are pulling it in, so they can access it directly without dealing with Java.
What about YARN? If some of your apps are in C and some are in Java, how do you manage the resources?
That actually is okay. YARN will allocate you resources, and then as long as your application knows how to talk to YARN and request resources, it doesn’t actually care whether it’s C++ or Java.
It basically says, you can run stuff over there, and it’s up to you what you run.
In Part 2, Owen will share some surprising news about the origins of Spark, Tez, and the stunning performance you can get from Hive with ORC and the new LLAP technology.