What kind of hardware do you need to run a Hadoop environment? How do you configure it? These questions can be answered by conducting a series of calculations, which, fortunately, are not so difficult.
Determine How Many Machines You Need
The number of machines you need to run Hadoop, as well as the specs for those machines, is dependent on:
• The volume of your data
• The retention policy of the data (how much you hang on to)
• The kinds of workloads you have (data science is CPU intensive, whereas generalized analytics is heavy on I/O)
• The storage mechanism for the data (whether you’re using compression, containers, etc.)
You Can Make Some General Assumptions & Perfect Your Hadoop Environment Later
To start out, it’s best to make some general assumptions about a few of the variables, or you’ll find that there are so many parameters that it becomes too hard to set up your Hadoop environment. As you learn and your Hadoop operations grow and mature, you can fine-tune your environment.
Data Capacity Planning
The number of machines you use for your Hadoop environment depends on how much data you need to store and analyze. This, in turn, determines the number of spinning disks you need to include on each machine. This is typically a fixed number of hard drives per machine. Capacity planning involves:
• How many nodes are needed
• The capacity of each node in terms of CPU
• The capacity of each node in terms of memory
HDFS is generally configured to replicate data in three ways. That means that you’ll need three times the actual storage capacity for your data. Plus, it is necessary to sandbag the machine capacity for temporary computations. As a general rule of thumb, you’ll want to keep disks at around 70 percent of their total capacity. It is also necessary to calculate the compression ratio. After determining how many nodes you’ll need, then calculate how many tasks can be managed by each node. Finally, determine how much memory you need.
A Special Note About YARN
When working in YARN, the fixed limits disappear. Just discard the idea of fixed slots, and YARN will configure resources in terms of the available memory and CPU. YARN controls the amount of memory and CPU on each node and makes the resources available to both maps and reduces.
Another must-have for any Hadoop operation is an easy, no-fuss way to access and integrate your enterprise-wide data into Hadoop. That’s where data integration solutions like those from Syncsort comes in. Syncsort’s multi-award winning data integration software provides organizations with an easy way to gather, transform and distribute batch and streaming data coming from multiple enterprise data sources, including mainframe and Kafka, for advanced analytics in Hadoop and Spark.