Open source software is eating the world of Big Data and analytics – so much so that it can be hard to keep track of all the open source data tools out there. Here’s a guide to the top open source data products that will lead the market in 2017.
When I say open source, of course, I mean tools or platforms whose source code is publicly available. Most open source data products are governed by Apache licenses (which happens to make them different from a lot of other big-name open source projects, like Linux and GNU, which use the GPL license), and most are available in raw form at no cost – although enterprises seeking to deploy these products in production will probably find the greatest value in commercial distributions that make these open source products easier to install and manage.
Open source software has been a central part of the Big Data world since platforms like Hadoop and programming frameworks like R, which appeared years ago, began offering data analysts easily obtainable and extensible tools for working with large volumes of information.
But the open source data world has expanded steadily since then. Today, a variety of other open source products are dominating the various niches within the data analytics and storage market. Let’s take a look at them.
Hadoop continues to loom large as a platform for storing large amounts of data on commodity hardware. But Hadoop is only one of many open source choices for handling this type of workload.
Also worth noting are scale-out storage systems like Ceph and Gluster. These distributed file systems don’t overlap completely with Hadoop, but they do provide the key functionality delivered by Hadoop’s file system, HDFS.
You can’t mention open source data storage without also taking note of the various open source databases that predominate for storing data and serving it to applications – in both traditional environments and massive scale-out infrastructure. Beyond MySQL, the tried-and-true open source database that companies have known and loved for a generation, there are also newer, more flexible “NoSQL”-style databases like MongoDB, Cassandra and Redis. The latter can be useful for handling data whose form and size are more diverse and unpredictable than the information handled by traditional databases.
Storing data is only half the battle. If you want to derive value from Big Data, you also need a way to analyze effectively and quickly – which is no mean feat when the information you’re dealing with has a massive volume, is stored on a multitude of different systems and needs to be interpreted in real time.
For this challenge, too, an array of open source data products are available. Some Hadoop components support analytics, although if you want to make data actionable in real time, Apache Spark, which speeds analytics by allowing data to be processed using system memory (in other words, RAM) instead of slower magnetic storage, is usually a better option. Spark is also probably the most well-known real time data analytics platform, but it is not the open source ecosystem’s only answer for real time analytics. Apache Storm and Apache Kafka provide similar functionality, and can be better fits depending on use cases.
Data Collectors and Connectors
So far, we’ve discussed solutions for storing and analyzing Big Data. The third piece of the data puzzle – which is equally important, even though it tends to receive less attention – are tools that help you collect data and transfer it between different types of storage and analytics platforms. These data collectors and connectors are essential for assuring that organizations can move data efficiently between the numerous platforms available for working with it.
On this front, open source tools like Apache Flume and Apache Sqoop are names worth mentioning. They help aggregate information from diverse sources and feed it into a platform like Hadoop.
To learn more about what is trending in the Hadoop, check out Hadoop Perspectives for 2017. This free eBook summarizes the results of Syncsort’s third annual Hadoop survey, uncovering the trends in Big Data to watch for in 2017!
Equally significant is functionality that is built into the open source code base of data analytics and storage platforms themselves to facilitate easy data collection and connectivity. For example, Syncsort has contributed code to Spark to simplify the process of loading mainframe data into the platform. Syncsort made similar open source contributions to Sqoop in order to make Hadoop more friendly toward mainframe data.
As we go into 2017, it’s clear that the open source Big Data world is as dynamic and diverse as ever. There are now so many products to choose from that users have no reason to lock themselves into a certain type of data storage or analytics solution. This is one big reason why the data interoperability provided by vendors like Syncsort, whose data solutions allow companies to move data from any type of infrastructure to the analytics platform of their choosing, are so important for the future of open source Big Data.