Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Hadoop on Amazon Web Services: How it Works

For many enterprises, the benefits of the Apache Hadoop platform are clear. When you’ve got terabytes of data and you need to process it quickly and inexpensively, the Hadoop platform offers clear advantages over traditional big-iron platforms. It’s scalable, it’s cheap, it’s fault-tolerant, and it runs on easily replaced commodity hardware. Hadoop makes good sense.

Suppose you were responsible for designing a real-time global weather tracking system. Readings would stream in from thousands of monitoring stations, doubtless in a babel of incompatible data formats, in a steady stream around the clock. Hadoop, with its support for parallelizing data processing tasks across multiple servers, would be a natural platform for combining this data into a coherent weather modeling system. You’d buy some inexpensive hardware, install Hadoop, and get to work.

But suppose your business needed Hadoop-level processing power only occasionally. Or what if your business’s growth path was uncertain, and you couldn’t know whether you would need a few or a lot of servers in your cluster a few months down the line?

The Elastic MapReduce solution

That’s where Amazon’s Elastic MapReduce service comes in. You can think of EMR, which is a key component of Amazon Web Services, as an implementation of Hadoop. (Actually, both EWR and Hadoop are implementations of ideas conceived in Google’s research labs and outlined in a paper presented at the 2004 Operating System Design and Implementation conference in San Francisco.) EMR runs in the cloud, on Amazon virtual servers, and customers are billed by the minute. Because the cluster is virtual, it can be scaled up and down as necessary. Customers needn’t purchase server hardware they need only a few times per year.

The New York Times turned to Amazon EMR when it decided to make 70 years of back issues free online. The database consisted of 11 million articles, each of which was composed of multiple TIFF images. When the back issues were available only to paid subscribers, the Times generated PDFs from the TIFF images on demand. But Times execs worried that when the database was opened up for free access, the servers wouldn’t be able to serve up PDFs in a timely way. They resolved to do a one-time conversion, generating a full set of 11 million articles from the 4 TB database of TIFF images.

Amazon’s EMR lets programmers take advantage of parallel processing on an arbitrarily large array of virtual servers.

The team behind the Times project didn’t have access to computer hardware that would perform this conversion in a reasonable time, so it turned to Amazon Web Services and EMR. Times programmers wrote software to handle the data conversion and ran it on AWS. Using 100 virtual computers provided via Amazon’s Elastic Compute Cloud (EC2), the Times converted all 11 million articles in less than 24 hours, generating 1.5 TB of PDFs. (In fact, the process ran twice in 24 hours, a Times blogger reports, as the first implementation of the code resulted in an error in the generated PDFs.)

Running Hadoop in the cloud

Establishing a virtual cluster of Hadoop servers on EC2 is a relatively straightforward process. The necessary software is already installed as part of AWS’s default offerings. Accessing them is a simple matter of filling out a series of dialogs and running a few configuration scripts. Hadoop clusters can be generated on the fly with batch files, with automatic incremental billing. Or you can contract for a requisite number of EC2 instances and keep your Hadoop services running around the clock.

Amazon offers EMR, its Hadoop implementation, to EC2 customers on a pay-by-the-minute basis. It’s pre-installed and convenient, and that works for lots of customers. But it’s also possible to contract for raw EC2 instances and install a different implementation of Hadoop. Data scientists at Rangespan, a high-tech firm that provides services to online retailers, chose to bypass EMR and install Cloudera’s CDH Hadoop implementation instead. The annual subscription price for CDH allowed Rangespan to save money compared to the per-minute charges associated with Amazon’s EMR. And Rangespan engineers found that CDH includes a preferable set of programming utilities and languages.

Once you’ve decided on Amazon EC2 virtual servers to handle your big data tasks, the next big task is to build your application. Getting productive with Hadoop can entail learning new programming languages, mastering new algorithmic partitioning models, and tracking Hadoop libraries and resources needed by your application. Bringing your technical staff up to speed can be expensive and time-consuming, potentially leading to costly delays.

Ease of design and development with Syncsort Ironcluster

That’s where Syncsort’s Ironcluster comes in.

Hadoop’s parallel-processing model is optimal for a common computing task known as ETL, during which large data sets are extracted, transformed, and loaded for processing. Many of the greatest operational and financial benefits of Hadoop are associated with ETL applications.

When time is money, high-level toolkits like Ironcluster are worth their weight in Bitcoins.

Ironcluster is a configurable ETL solution for Hadoop systems, including EMR on EC2. Ironcluster eliminates much of the need for manual coding and allows developers to get results fast, making development on Hadoop much simpler. With Ironcluster, software developers can design MapReduce jobs using a graphical interface, without writing a single line of code. Best of all, it’s scalable. Start with a small implementation and add more EC2 instances as needed.

Hadoop is an effective technical solution for managing huge stores of structured and unstructured data. Amazon’s cloud-based Hadoop platform, EMR, is a scalable, affordable framework for supporting Hadoop solutions. And Syncsort’s Ironcluster helps to make the construction of those solutions fast, inexpensive, and reliable. It’s no wonder more and more enterprises are turning to AWS as their cloud-based parallel processing platform of choice.

Try Ironcluster now with a FREE full-featured license

Related Posts