Data Integration

What an exciting couple of weeks for Hadoop and the future of data management platforms. Announcements of Cloudera and Hortonworks closing new investment rounds with hundreds of millions, followed by several product announcements strengthening Hadoop towards being an enterprise ready platform.

Following Hadoop World, I shared my perspective on how YARN-based Apache Hadoop 2 is defining the future of data management platforms, enabling a variety of workloads and use cases to run in Hadoop. Cloudera’s Beta announcement at the time and introduction of Enterprise Data Hub reflected their vision to take on more enterprise workloads and position Cloudera in the center of enterprise data management.

Cloudera 5

With general availability of Cloudera Enterprise 5, Cloudera is now executing full force towards tackling all data workloads. This release, powered by Hadoop, is declaring the data warehouse space with an enterprise-strength data management platform; offering a full stack of applications and tools for big data analytics. Cloudera Enterprise 5 lays the foundation for Enterprise Data Hub, one place to store and process all data. This release includes many new components including Apache Spark, Parquet, Kite SDK, etc. and synchronizes delivery of existing ones, for example Cloudera Manager; simplifying deployment, provisioning and monitoring while powerfully supporting new use cases and variety of workloads.

Cloudera certifiedThis release also simplifies deployment of 3rd party software from strategic certified partners like Syncsort, Revolution Analytics and SAS. These services can now get the same benefits of data-local processing services such as Impala and Search. Syncsort’s Big Iron to Big Data solution runs natively in Cloudera Enterprise 5 and it is deployed through Cloudera Manager Parcel integration, offering a simple and unified deployment experience for the end user. As Mike Olson recently tweeted, “IBM offers Hadoop on the mainframe. Syncsort offers the mainframe on Hadoop.”

Cloudera’s positioning of Hadoop significantly changes as a full stack of applications is offered at the center of enterprise data management, complementing existing infrastructure and leveraging domain expertise from several vendors. Intel and Cloudera alliance is very complementary to Cloudera’s vision, addressing security, storage, etc., leveraging Intel’s footprint in data centers globally and accelerating Hadoop adoption further.

All of these recent developments show how leaders in the Hadoop ecosystem are enabling Hadoop’s role in solving Big Data challenges – something I recently spoke about at the Georgian Partners CTO Conference and a related interview.

As Doug Cutting predicted, there is no limit to the workloads that can run in Hadoop and Cloudera is certainly pushing the envelope with Cloudera Enterprise 5.

Congratulations to Cloudera for taking such a bold step!


Calling it a holy grail might be an exaggeration, but the idea of an enterprise “hub” or “bus” has been kicking around for a couple of decades now. It can probably be retraced to the earliest ideas about enterprise architecture written by John Zachman in the 1980′s. But The Open Group’s Len Fehskens goes back further, calling it “. . . another round of realization that things ought to be modularized.”

So what makes a hub a hub? What is different about a Big Data enterprise hub?

About the Hub (Bub)

Ask Cloudera, and they have a simple response: “The foundation of an enterprise data hub is, of course, Apache Hadoop.”

Well, not just Hadoop. Then Cloudera added their management tool, Impala for processing SQL over Hadoop, Spark, HBase, extensions for DR and backup, Apache Sentry to improve Hadoop security, and then Navigator for metadata management.

Cloudera believes all this value-adding labor adds up to more than the sum of its parts.

Is that sum what in an earlier stage of enterprise architecture widget-making was called the Service Oriented Architecture (SoA) approach?

Enterprise Hubs Likened to Data Freight Services

After all, messaging middleware isn’t unique. For example, Apache Mule is an existing tool that implemented a widely discussed Enterprise Integration pattern presented by Gregor Hohpe. Apache ActiveMQ provides Mule with additional glueworks. In the Apache open source world, others have embraced Camel. IBM’s WebSphere has an ESB, as does Oracle. Even .NET isn’t left out as Microsoft has an ESB for Azure.

SOA as a concept seems to be alive and well, though in the downward side of Gartner’s hype curve. In a paper on business service modeling, a 2012 paper proposed using SOA as the basis for business service architecture (Zikie, et al. 2012). This was largely the approach taken by a joint Boeing-GMU project undertaken a decade earlier (Sommer et al., 2002).

Deprecate XML if you must use JSON instead, and you have the glue that many cloud services are using today.

Enterprise Service Bus (ESB) was described by Ueno and Tatsubori as simply “a bus which delivers messages from service requesters to service providers.” Is Cloudera up to something different, or does Ecclesiastes 1:9 apply (“. . . there is no new thing under the sun”).

New Things Under the Sun

David Sprott, in “Death of Enterprise Architecture” proposed a new term, “Smart Ecosystem Architecture.” In his Copernican way of seeing things, the enterprise should not see itself as the center of the universe – creating and consuming its own services in isolation. Instead, each enterprise is part of a larger ecosystem consisting of multiple service producers and consumers. He defines Service Factory as a Service (SFaaS) — increasingly delivered “in an MDA/MDD stack.

David Sprott is betting on a future for Service Factory as a service.

Whether Sprott has this exactly right probably resolves around one’s favored toolbox. But if the stack deployed by startups is any indication, an ecosystem around Cloudera’s products – supplemented by specialized cloud services like MailChimp or RunMyProcess – can spin up quite a few apps that would take much longer to create using other methods.

Analytics as the Killer App

Gartner’s Philip Allega wrote:

It is clear to Gartner that the role of Enterprise Architecture is not ‘dead.’ It has instead just found its seat at the IT leadership table.

Of course there is no one killer app, as MySpace and WordPerfect demonstrate. But in the cauldron of Big Data software, analytics is as good a choice for a killer app as any. Cloudera believes:

The real benefit comes when you can increasingly bring new analytic workloads to the EDH, reducing the need to invest in and move large volumes of data between platforms just to ask new questions.

Syncsort’s DMX-h joins a circle of Cloudera enablers by SAS, Revolution Analytics and Splunk. They have weaponized the killer app that is analytics.

Lock and load!


Bo, D., Kun, D., Xiaoyi, Z., Oct. 2008. A high performance enterprise service bus platform for complex event processing. In: Grid and Cooperative Computing, 2008. GCC ’08. Seventh International Conference on. pp. 577-582. URL

Daigneau, R., Nov. 2011. Service Design Patterns: Fundamental Design Solutions for SOAP/WSDL and RESTful Web Services, 1st Edition. Addison-Wesley Professional. URL

Hohpe, G., Woolf, B., 2003. Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. URL

Sommer, R. A., Gulledge, T. R., Bailey, D., Mar. 2002. The n-tier hub technology. SIGMOD Rec. 31 (1), 18-23. URL

Ueno, K., Tatsubori, M., Sep. 2006. Early capacity testing of an enterprise service bus. In: Web Services, 2006. ICWS ’06. International Conference on. pp. 709-716. URL

Zikie, F. A., Dico, A. S., Debela, D. M., 2012. Business service modeling using SOA: A core component of business architecture. In: Proceedings of the International Conference on Management of Emergent Digital EcoSystems. MEDES ’12. ACM, New York, NY, USA, pp. 181-182. URL


Mark Underwood writes about knowledge engineering, Big Data security and privacy.

Twitter @knowlengr LinkedIn


Telecommunications has changed a lot since the breakup of the AT&T monopoly. But one thing hasn’t changed: telecommunications is a capital intensive business with big appetite for data. Now it’s competitive and capital-intensive. Telecom data hunger continues.

Telecom Big is Nothing New

Big Data is not new to telecom. Information about calls, or call-detail records (CDRs), has been collected for monitoring and billing purposes for decades. High speed performance has been a requirement, not a nicety, for telecom. As a result, in-memory databases were first deployed by telecoms in the 80′s.

The scale of these collections became clear when the NSA’s MAINWAY database came to light in 2006. According to estimates made at the time MAINWAY contained 1.9 trillion records. A given carrier probably stored more than this, depending on how much history was maintained. A 2003 listing of large databases included AT&T’s Security Call Analysis and Management Platform (SCAMP). At the time AT&T was keeping two years of call data, which amounted to 26.3 TB, running on AT&T’s Daytona database. A dated report on Daytona, which called it a “data management system, not a database,” indicated that in 2007 Daytona was supporting 2.8 trillion records.

In 2007, it was becoming clear that the 2005 Sprint – Nextel merger was going to be a financial calamity for Sprint. What else has changed in the telco landscape over the past seven years or so?

That was Then. Now Cometh the Petabyte.

As T-Mobile’s IBM project member Christina Twiford said in 2012, it’s that data volumes grow quickly. The Netezza platform she discussed then had to scale from 100TB to 2 petabytes. The operation she described then loaded 17B records daily, and was supported by 150K ETL or ELT processes. Access to this data was provided to 1300 customers through a web interface.

“Hadoop is exciting in that you can through in the structured and the unstructured data for advanced analytics. . . And you can process streams of data without ‘landing’ the data. . . Telcos are all over M2M or the Internet of Things,” she said.

Tim Eckard of Zaloni presented some telecom use cases at a 2013 Telecom Analytics conference in Atlanta. One unnamed wireless operator had a Central Data Mediation Archive used for e-discovery, billing disputes and the like. The size was a total of about 1PB growing at 12TB/day in 2012, and doubled to 25TB/day (5B records) in 2013. According to Eckard, the customer was able to move to a Hadoop environment for ¼ the cost of moving to a traditional (relational?) database setting.

Pervasive described similar results with a smaller telco provider whose Call Descriptor Records were captured with Pervasive’s DataRush. Telco switches dump data every 5 minutes, hundreds of thousands of records per dump, so the telco objective was to increase their speed of processing by an order of magnitude. This was essential in order to improve decision processes, measures of real time profit margin and operational analysis.

IBM’s Robert Segat likewise suggested that telcos would be using analytics for purchasing and churn propensities. Challenges faced by telcos: disruptive competitors, LTE / 4G, increased regulation, support for streamed data – each would be opportunities for Hadoop and Hadoop-like platforms. In particular, wireless data is expected to grow by 50X over the next decade. There are also 30 billion RFID sensors, customer social network information and a greater variety of traffic types – especially video.

What, Me Archive?

According to Hortonworks, its Hadoop distro is being used at seven US carriers. Despite the volume – which BITKOM’s Urbanski indicates in Gigaom is around 10 million messages per second — a telco can save six months of messages for later troubleshooting. This is both an operational and a competitive advantage if it can save churn.

Wireless data is expected to grow by 50X over the next decade

Leading edge solutions include using Hadoop data to allocate bandwidth in real time. This opportunity is appealing to telcos because service level agreements may require dynamic adjustments to account for malware, bandwidth spikes and QoS degradation. Such solutions likely involve all facets of Big Data: volume, velocity and variety of data sources.

Forrester’s Brian Hopkins identified big data for real-time analytics as the number 2 “most revolutionary technology” according to a 2013 survey of enterprise architects.

This optimistic picture clearly omits some of the detailed steps needed to fully deploy a telco Hadoop application. Far from being a greenfield application, most telco Big Data deployments will be bigger and faster with more data sources. But how is that data to be loaded into Hadoop?

Expectations for data pipeline speed are high. ETL toolkits must play nicely with the Hadoop ecosystem. Its proponents in telco, or elsewhere for that matter, have high expectations for developer ease of use and execution-time performance. Convenience, typified by offerings like as Syncsort’s Mainframe Offload allow for Hadoop to coexist with legacy mainframe tools like COBOL copybooks and JCL – without the need for additional software on mainframes. For telcos considering AWS, projects can be rapidly provisioned in EC2 and connect to ETL data sources in RDBMS, mainframe, HDFS,, Redshift or S3.

From CDR to CSR

If your phone rings a few seconds after your next dropped wireless call, it may well be a customer service agent with access to real time analytics to aid in troubleshooting the problem.

Assuming you’ve paid your bill. They’ll know that, too.

Mark Underwood writes about knowledge engineering, Big Data security and privacy.

Twitter @knowlengr LinkedIn


If a picture is worth a thousand words, is a picture that’s a thousand times bigger better still?

These and other questions are being asked by user experience designers as the tidal wave that is Big Data rolls into the human-computer interaction space.

In recent years, several IPOs reflecting increased demand for business intelligence software demonstrated the need to put data management in the hands of analysts, not only database admins. These IPOs included Qlik Technologies (2010, current market cap $2.4B), Splunk (2012, $8.2B) and Tableau (2013, $4.7B). Earlier generations of BI software provided visualization tools, but Tableau, QlikView and Splunk were part of a crop promising to extend business visualization to a wider audience with fewer intervening steps.

ETL provider Syncsort, Cloudera and Tableau discuss these and other issues in a March 2014 presentation

Data as the New “Soil”

David McCandless asserts in his 2010 TED talk that “Data is the new soil,” and that visualizations are flowers derived from plants tilled in this ground. He believes some analysts – even untrained ones – have a sort of “dormant design literacy” about visualizations. He says that by ingesting a steady diet of web visualizations, users have unintentionally developed expectations about what they see and how they want to see it. In the dense information jungle, he argues, visualization is a relief, “a clearing.”

Marching a bit out of step with Big Data has been Big Software, which has also seen a rise of visualization tools. Software by Irise, reportedly adopted somewhere within 500 of the Fortune 1000, relies on visualization to improve efficiency and transparency in the software development life cycle. Part of what makes Irise attractive to Big Data developers are its gateways to fast-growing repositories like HP Quality Center and IBM Rational.

Also growing from this “soil” are visualizations that bridge data and knowledge. Think Visual Studio on steroids, capable of navigating both data and semantic space. What’s really needed, wrote Bret Victor in 2011, is a systematic approach to interactive visualization, a capability to navigate “up and down the ladder of abstraction.”

Look Up! It’s Cloud Viz.

Most of the familiar names in BI have roots in client/server architectures, though most offer some sort of cloud version by now. Products like Birst Visualizer, ChartIO, RoamBI and ClearStory are newer cloud-based offerings. BigML, aimed at DIY predictive analytics from the cloud, includes a visualization component. For instance, Ayasdi uses “topological data analysis” to “highlight the underlying geometric shapes of your data and allowing for real-time interaction to produce immediate insights.” Microsoft’s Power BI offers data visualization capabilities supporting million-row tables for Office365 users.

Microsoft Power BI Demonstration (Screenshot)

“Show Me” (Don’t Tell Me)

Think your data challenge is big? Please, to borrow from the Rush song of the same name, show me, but don’t tell me. Big Data visualization is about saying more with less.

Research is underway to explore new ways to visualize big data. Here is one example.

A report by Aditi Majumder and Behzad Sajadi demonstrates how multi-projector displays can deliver visualizations for uses such as immersive simulation or scientific data visualization. They have augmented their displays with gestural interfaces that allow one or more users to interact with these display in real time, performing tasks such as navigating 3D models. In addition, the research demonstrated a capability to light any object to act as a screen for visualizing data – using them as “the visualization medium,” even for everyday objects.

Show Me a Story

There’s more to today’s visualizations than flipboards and hi-res maps. In Business Storytelling for Dummies (Wiley 2014), Karen Dietz and Lori Silverman tell us that visualizations don’t have to tell a story. But, they say, “if you want to use data and data visualization to move people to action, then melding story and data together is essential.” They offer prospective storytellers a few guidelines:


  • Show, don’t tell. Try removing text to see if the visuals still work.
  • Be clear about what you want to accomplish. Dietz and Silverman recommend revisiting the knowledge hierarchy associated with the concepts.
  • Be wary of eye candy. A pleasant-looking visualization doesn’t guarantee that viewers will process the intended message.
  • Slickness and flashiness aren’t necessary. Beware of “. . . thinking that the data is what needs emphasis, when the real key to success is the story about the data (p.155).”


Visualization for Four V’s

Some aspects of good visualization will remain unchanged from the pre-Big Data buzzword era. So what is new? What will need to change by virtue of increased volume, velocity, variety and veracity?

The impact of Big Data will be felt in unanticipated realms. A recent New York Times story identified Big Data genealogy as leading to potential insights into disease inheritability and untold family connections to major historical events. Author A.J. Jacobs explained:

Last year, Yaniv Erlich, a fellow at the Whitehead Institute at M.I.T., presented preliminary results of his project FamiLinx, which uses Big Data from Geni’s tree to track the distribution of traits. His work has yielded a fascinating picture of human migration.

Even a modest homegrown family tree can be difficult to print. Splicing together multiple pages is the norm for DIY genealogy. A fully interlinked tree that connects to the trees created by hundreds or thousands of others demands a scalability that earlier UX design simply lacked.

Beyond the Visionary

Big Data may not lead to bigger stories. Analysts may discover that some old methods – bar charts, line graphs, heat maps – work as well as ever. But a big dose of any one of the V’s – especially velocity – can change all that. Big Data visualization will have to change accordingly.

But in many other domains, animation and 3D may need to become as commonplace as bar charts.

Mark Underwood writes about knowledge engineering, Big Data security and privacy @knowlengr LinkedIn.