Spark: Catching Fire With or Without Hadoop
Fire and water seem to dominate discourse about Big Data.
Big unstructured data repositories, often built around the Hadoop file system, are called “data lakes.” Then there is the open source project begun at UC Berkeley in 2009 — Apache Spark. The Spark community has grown steadily since then, and is now one of the largest open source communities in Big Data, boasting 750 contributors from more than 200 organizations.
Comparisons between Spark and Hadoop can quickly get very technical, and there is friendly competition between Apache Storm and Apache Spark. One common theme is that advocates for Spark claim there is less effort to get from raw data to analytics, especially for machine learning applications, which are facilitated by Spark’s integration of MLib, a machine learning library that plugs into Spark’s APIs.
Spark can run on existing Hadoop clusters.
Spark doesn’t leave Hadoop behind. It can ingest any Hadoop data source, and Spark can be readily worked into existing Hadoop pipelines. Spark does not require a separated data center effort; it can run on existing Hadoop clusters.
One leading integrator refers to the Spark platform as more “unified,” blending “ETL, interactive queries, machine learning and streaming analytics.” IBM has announced a major commitment to the platform.
Analytics with Ubiquity
Spark’s ubiquity is one of its strengths. Sure, some industry buzz has it that it can “spark” a Hadoop installation into hyperdrive, but Apache Spark need not be partnered with Hadoop. It can also be standalone, in cloud settings or on Mesos. (Mesos is especially noteworthy because it powers the recently enhanced Apple Siri).
As Big Data delivers even greater Velocity and Volume, the need for faster data filtering and aggregation could well lead to greater Spark adoption.
As Tendü Yoğurtçu, general manager of Syncsort’s Big Data business explained:
We believe that Apache Spark will play a critical role in a wide variety of next-generation use cases, including streaming ETL and the Internet of Things. We will continue to contribute to Spark and related Big Data projects to enable a uniform user experience for batch and real-time workloads across all data sources.
Variety – a third property of Big Data after Velocity and Volume – also figures in the popularity of Spark. In addition to Syncsort’s open source mainframe hookup, Spark connectors are also available for Qlikview, Tableau, Pentaho, PanTera, Zoomdata and TIBCO Jaspersoft.
Apache Spark can stand alone.
Using these tools, data from multiple sources can flow into Spark-enabled data lakes and business intelligence dashboards.
Like a Ton of Databricks
Several Spark creators founded Databricks in 2013. In June of this year, Databricks announced availability of its Amazon cloud-hosted data platform. While enterprises can still roll their own using standalone Spark, the Databricks cloud service allows data scientists to focus on analytics. Future plans include what the folks at Databricks call “R-language notebooks,” designed to foster greater use of R with Big Data. There is even a SparkR dialect.
Databricks believes Spark can play an important role in improving Big Data visualization. The Company’s Hossein Falaki cites pairings with open source visualization tools like D3, Matplotlib and ggplot as examples of how Spark can facilitate big data science by making it easier to sample and manipulate large datasets.
Cloudera’s Justin Kestelyn refers to Spark as exerting a “gravitational pull,” but a stronger metaphor may prove to be the creativity Spark nurtures in future application design.