As with most new tech undertakings, if you’re considering Spark (and you almost certainly are), that means considering Spark in the cloud. Spark doesn’t go in the cloud just because it’s trendy. There are some legit reasons it belongs there.
For one, the matter of getting all that data into Spark is no trivial task. If you are a Fortune 500 company, or in financial services, healthcare or retail, it usually means offloading mainframe data, which can be difficult and time-consuming unless you have the right tools. Stashing Spark and its related data stores in the cloud means it’s not on premises, mucking up your other daily operations. Spark in the cloud means no additional, expensive additions to your onsite infrastructure, so you can use it for either an experimental testing grounds or for a brand new initiative, without increasing the complexity of your existing IT operations.
Spark Isn’t Easy
Where there’s a Spark, there’s a fire. Don’t look now, but the cloud is burning.
Giving Spark its own little house in the cloud also means that it can be made available across the organization. The problem is, Spark is hard. Data scientists usually use Python and R to coax information out of it, but your average user won’t have those capabilities. In these cases, you’ll need to plan for some kind of user interface so they can interact with Spark without having to learn any difficult programming languages. The good news is, once you have made Spark readily accessible to the masses, almost every department can put it to good use. New and valuable uses for real-time analytics exist in every facet of the organization, from marketing to finance, and administration to production.
Spark is Fast but Getting a Spark Platform Together Isn’t
While Spark is very appealing for its speed, getting Spark in the cloud and empowering it with connectivity is not. Though there are incredibly potent tools for getting your data into Hadoop and Spark, making it sing and dance is a different story. The how-to’s of getting this done depends very much on what cloud solution you choose to use. Some solutions are basically Spark-as-a-Service. With these products, you essentially load your data in there, populating a Hadoop data lake to access in the Cloud, and then you can start analyzing. This way, you don’t have to build and configure Spark clusters, which is resource intensive. The other up-and-coming option is to DIY your own Spark connectivity into an existing service, such as one of several enterprise-class cloud services available. A few of these providers are already offering their own specialized Spark services.
You Need the Right Tools
Spark is powerful and useful, but it is not easy to work with. Instead of having to undergo a lengthy and complicated DIY project for each step of getting Spark in the cloud, getting the data into Spark, and enabling users to leverage the platform – simply take advantage of the tools available to you.
Of course, the more ready-to-use tools you’ve got at your disposal, the quicker and easier this entire process will be. For instance, there is a solution that allows you to make mainframe data available Spark (as well as the rest of the Hadoop ecosystem) in its native format. It’s called DMX-h by Syncsort. Customers in finance, insurance, healthcare, and other industries are already using this product to create a data lake environment for Spark processing, and it works with your onsite data analytics operation or in a cloud environment.
You can learn about DMX-h and how it helps bring Big Iron to Big Data platforms, including Hadoop and Spark in the video, “Big Iron, Meet Big Data, Liberating Mainframe Data with Hadoop and Spark.”