In January 2013, you hosted Mike Lamble’s story in which he wrote: “A growing portion of advanced big data practitioners are finding that Hadoop tools are marginal for big data integration and analytics of structured data. Relative to expectations of the business intelligence community, Hadoop query times are slow, its access methods are arcane, it isolates data subject areas from each other, and it lacks a rich third-party ecosystem of tools for analysis, reporting, and presentation.” A year and half later, what’s your current assessment?
The hype surrounding Hadoop is out of control, and there have been a lot of expectations about the technology that will never be met. But Hadoop is a viable technology for managing and analyzing very large data sets. But it is not a database management system, and it is not for real-time querying. Much of the mismatch between the expectations of the business intelligence community and the reality of what Hadoop delivers can be traced to misunderstanding and over-marketing, in my opinion. Slowly we are seeing the Hadoop ecosystem becoming more robust. Tools like Pig and Hive are becoming more mainstream, albeit still pre-version-1 offerings. YARN, aka MapReduce 2.0, bolsters the resource and job management capabilities of Hadoop. Spark makes it easier to write applications for large-scale data processing, combining SQL, streaming and complex analytics. And we are seeing many companies building out their Hadoop offerings with more usable capabilities (e.g., HortonWorks, IBM, etc.). While things are not nearly as usable and manageable as in the relational DBMS community, the management and usability of Hadoop is, indeed, getting better.
You have written at TheDatabaseSite about the importance of metadata for big data. Do you think this is an area that still needs further technology investment, or is the lack mainly one of practitioner attitude?
I think the issue is primarily one of short-term thinking. But there is probably some greed and ignorance thrown in there, too. There are a good number of software solutions that aid in the discovery, cataloging and definition of metadata. Of course, newer, better, cheaper, more functional solutions are always welcome, but the lack of technology is not the problem. The bottom line is this: Organizations need to define and classify data in order to ensure compliance with the regulations that impact their industry, and therefore, their data – as well as to exploit the big data at their disposal and turn it into actionable business intelligence. Without accurate and up-to-date metadata, there is no hope of assuring that your company is in compliance with all of the laws that impact your data – no hope at all. And the prospects for taking advantage of big data dim significantly.
Metadata is required in order to use any application. It is impossible to correctly use computer software and make accurate decisions if you do not understand the data with which you are interacting. Metadata makes understanding possible. It should stand to reason that you would want to define and understand all of the data that your organization uses and cares about, right? If you don’t want to define and categorize the data appropriately, then why even collect, manage, and process it? And with regulatory compliance requirements, how can you verify that you are in compliance if you have not accurately defined your data elements? So why isn’t metadata more embraced and adopted within modern organizations? Getting along with undefined data, bad data quality and other data problems has been de rigueur for a long time now. Solving the problem will not be quick and easy; it too will take time. And money. But what executive is willing to spend to solve a problem like this where it is difficult to quantify the ROI, there is little in the way of policing (to ensure compliance) and organizations have been muddling along in the dark for years? I think we all know the answer to that question. With big data, the lack of metadata will become even more problematic. Mo’ data, mo’ problems.
You ended the commentary about metadata by advising organizations to invest in programs to improve metadata. Is there a role for Master Data Management in this process? If not, what role do you see for MDM in big data?
Of course MDM has a role in the process of improving your metadata. The devil is in the details. Inasmuch as MDM uses processes, governance, policies, standards and tools to consistently define and manage an organization’s critical data, MDM is itself critical. But what does critical mean? Well, different things to different organizations, as it should. With an MDM program in place, organizations will have thought about that question – which data is critical – and applied processes and rules to improve the quality of critical data. To advance their big data agenda then, they can build upon the start they’ve achieved via MDM and gain an advantage of competitors who have not tackled MDM. Now I would not necessarily say that all of the rigor deployed in the MDM program need be applied to all of the big data being brought into the organization for analysis. The volume and velocity associated with big data makes that level of management and governance harder to achieve. But with MDM in place, your organization stands a better chance for bringing in the right type of data that can mix and match up with existing data, thereby improving the odds of capturing business insight in the advanced analytics run against the big data.
You offered some guidance for prospective DBAs, which included a stern warning about long hours. Point taken. How might big data change things for the DBAs of this world?
How might life change for DBAs? That’s a loaded question. Life is always changing for DBAs! The DBA is at the center of new application development and therefore is always learning new technologies – not always database-related technologies. Big data will have a similar impact. DBAs should be learning NoSQL DBMS technologies, but not with an eye toward replacing relational. Instead, at least for the time-being, NoSQL technologies (Key/Value, column, document store and graph) are currently common in big data and advanced analytics projects. My view is that these technologies will remain niche solutions and that relational DBMSs will add functionality to combat the NoSQL offerings (like they did to combat the Object-Oriented DBMS offerings in the 1990s). Nevertheless, DBAs should learn what the NoSQL database technologies do today so that they can help to implement projects where organizations are using NoSQL, but also to get ahead of the functionality that will be (is being) added to the Big Three RDBMS products (DB2, SQL Server and Oracle).
DBAs should also learn about Hadoop. Now Hadoop is not a DBMS, but it is likely to be a long-term mainstay for data management, particularly for managing big data. Hadoop education will bolster a DBA’s career and make them more employable long term. It would also be a good idea to read up on analytics and data science. Although most DBAs will not become data scientists, some of their bigger users may be. And learning what your users do – and want to do with the data – will make for a better DBA. Finally, I would urge DBAs to automate as many data management tasks as possible. The more automated existing management tasks become, the more available DBAs become to learn about and work on the newer, more sexy projects.
Syncsort is deeply involved in the ETL process and the general data pipeline into HDFS repositories. Which discipline within an organization is responsible for ETL/ELT and the rules surrounding it? Is this a purely IT function, or should it be moved into the data owners’ domains – e.g., marketing or logistics?
As with most questions of this type, the answer is not an either/or – yes/no – type of answer. I think you need to split things up into the business requirements and the IT implementation. From a business requirements perspective, the subject matter experts (SMEs) must define the latency requirements that are acceptable, as well as setting the budget for the data movement. These things need to match. For example, you can’t request zero data latency with a small to non-existent budget. Wait, I guess you can request that, but you won’t get it! This is where IT must participate.
Understanding what tools are available, how they work, what they cost, and how they can integrate with the systems and databases in use fall within the domain of IT. And these things impact the cost. IT also is generally tasked with setting up the technical aspects of the project: installing the software, configuration, aligning with existing technology, etc. Building the ETL scripts and tasks can also be a shared duty between IT and the SMEs. Depending on the complexity of the solution and its interface, the business SMEs may be able to create and drive the extraction, transformation and movement tasks. But this really can only be done after the IT folks set up the software with an understanding of the environment so that, for example, the business user does not request a long-running, system resource consuming extraction to run in the middle of the busiest period of the day.
If, as you write, big data is more of an “industry meme” than a technical term to be agonized over, should technology journalists and internal evangelists downplay the terminology in favor of more widely recognized (i.e., more mature) expressions, such as VLDB, distributed computing, grid computing, or just cloud computing?
The short answer to your question is “No.” The longer answer takes some explaining, so here goes. Big data is undoubtedly a term that is being used to hype certain types of products. And the industry analyst firms have come up with their definitions of what it means to be processing “big data,” the most famous of which talks about “V”s. As interesting as these definitions may be, and as much discussion as they create, the definitions don’t really help to define what benefit organizations can glean from big data.
So, with that in mind, and if we wanted to be more precise, it would probably make sense to talk about advanced analytics instead of big data. Really, the analytics is the motivating factor for big data. We don’t just store or access a bunch of data because we can; we do it to learn something that will give us a business advantage. That is what analytics is: discovering nuggets of reality in mounds and mounds of data. But I am not in favor of that (I’ll tell you why in a moment). The driving factors for the growth in big data analytics are mostly focused around providing insight into business practices and conditions that can deliver a competitive advantage. By analyzing large amounts of data that were previously unavailable – or were difficult to impossible to analyze in a cost-effective manner – businesses can uncover hidden opportunities. Better customer service, increased revenue and improved business practices are the goals that drive big data analytics programs.
So why wouldn’t I endorse talking about analytics instead of big data? Well, more than half the battle is getting the attention of decision-makers, and the term big data has that attention in most organizations. As a data proponent, I think that the data-focused professionals within companies today should be trying to tie all of the data management and exploitation technologies to the big data meme in order to get funding. That way, we can better manage the data (small, medium and big) that we are called upon to manage!
Will big data lead to bigger IT budgets? Or is this shift in the value proposition simply going to stimulate some realignment within existing IT budget categories?
I think the days of burgeoning IT budgets are likely over, and that as big data spending flows, other spending is likely to ebb. Gartner’s forecast for IT spending growth (as of 2Q2014) stands at 2.1%. IDC forecasts that the market for big data technology and services will grow from $6 billion in 2011 to $23.8 billion in 2016. Clearly, the money being spent on big data is not new money or Gartner’s 2014 IT spending growth forecast would be larger. Or perhaps both analyst firms are “off” in their predictions. I can see, perhaps, too, some increase in the budget to advance an organization’s big data program. Forward-thinking organizations that understand the value of wisdom in large amounts of data, if only they take the time to uncover that wisdom somehow, will probably be inclined to spend a bit more to try to win that gamble. But it is a gamble, so corporations (conservative by nature) will, I think, in most cases, shift money around instead of spending a lot more. Now guessing which categories will suffer at the expense of the big data spending spree is another matter altogether.
Does big data, considered on its own, imply a larger need for beefier resources in risk management, privacy and security? Or will organizations expect security firms to do the realigning for them with no additional expense?
I try to avoid predicting things, but I am happy to speak about what should happen. Big data does indeed imply the need to implement improved techniques for ensuring and managing privacy. In my experience, customers are becoming increasingly suspicious of big companies in terms of what data is being collected and how that company secures and protects the data. Most organizations could use improved techniques and tools for protecting “little/medium” data, let alone big data. For risk management, organizations must be able to quantify the business value of the big data. And categorize exposure and loss of data in terms of the reduction in value, impact to the company’s reputation, loss of potential trade secrets, etc., etc. So, yes, it stands to reason that when you start performing a new type of data processing (e.g., big data analytics), a company should examine the risk, governance, privacy and security aspects of that new technique and/or application. And that would come with a higher expense. All that said, the cynic in me says that most organizations will attempt to tackle big data without fully funding, or perhaps even analyzing, these aspects.
Most DBAs have cut their teeth supporting client server applications using an RDBMS (SAP + Oracle DB, SAP + DB2, Microsoft Dynamics + SQL Server, etc.). If big data trends toward higher velocities result in a move closer to real-time event processing, what needs to change in design and maintenance practices for DBAs?
First of all, I’d augment your statement to include homegrown applications instead of just the COTS applications you mention. And some DBAs have focused on BI and data warehousing applications instead of the predominantly transactional applications you cite. I bring these things up to emphasize that DBAs already support multiple and various requirements. That means that DBAs are accustomed to supporting many different requirements and learning new techniques. This will serve them well as they find their way into supporting big data and advanced analytics projects.
The second aspect of your question is that higher velocities will result in a move to real-time event processing. And yes, in some cases, that will be the goal of a big data project. For example, fraud detection benefits from being as close to real time as possible, in order to perhaps become fraud prevention! And real-time ingestion (or processing) of large data streams is another aspect of big data projects that is likely to be a challenge for DBAs. But also keep in mind that big data can power predictive analytics. This does not require the real-time aspect so much. By analyzing reams of data and uncovering patterns, intelligent algorithms can make reasonably solid predictions about what will occur in the future. This requires being adept enough to uncover the patterns before changes occur, but not necessarily in real time.
But to circle back and fully answer your question, DBAs need to understand the shortcomings of existing practices and techniques for certain types of analytical processing. They will need to keep their existing skills, because relational databases are NOT going anywhere, nor are OLTP applications. But they will need to augment those skills with different design methods, some simpler (key/value) and maybe some more complex. They may need to forgo some control over the schema in order to allow more flexibility (as with document databases) for certain types of big data implementations. The reality is that things are changing to support big data and DBAs will need to understand what that means before they understand what new they will need to learn.
I was reminded by your reposted message about a DBA code of ethics – or the lack of one – that DBAs are a significant insider threat in most organizations. Are there best practices to address this, or does the issue fall under a contemplated DBA code of ethics?
Currently, there are no best practices that I am aware of for a DBA code of ethics. I think that a code of ethics would be a good idea for most management disciplines, DBA included. But even with a code in place, that does not solve the problem. Human nature, being what it is, we’ll see folks who’ve signed a code (or agreement) who still violate it. The other aspect of this is that in my experience DBAs are, for the most part, trustworthy and want to do a good job in terms of managing and protecting their company’s data. This is the case even without a DBA code of ethics. Oh, there are always exceptions (see http://www.computerworld.com/s/article/298312/Rogue_DBA_Steals_Sells_Personal_Info), but most DBAs have significant IT experience and worked their way into a trusted position as a DBA. Even so, creating a DBA code of ethics – and having DBAs sign it – would help to raise the awareness of the potential threat. And that would be a good thing. For the most part, SEIM and database auditing techniques and software are used by progressive organizations to perform privileged user (e.g., DBA) monitoring. One of the long-standing issues with such approaches is the number of resources that auditing can consume. But if you tackle the task appropriately, by pinpointing who and what needs to be audited – and possibly using advanced software to minimize the overhead – then organizations can monitor this type of insider threat quite well. Now that does not mean that most organizations actually DO this type of monitoring, only that it is possible and can be done.
After the Neiman Marcus and Target breaches, Jaikumar Vijayan questioned in Computerworld whether PCI compliance means anything. Are current PCI standards adequate to address big data scale and scope, or is the problem with the implementers?
PCI, as with any standard, must be a constantly evolving standard. It takes training, effort and planning to keep up with the bad guys out there. A standard that works today fails tomorrow because a skilled hacker figures out a new way to exploit a piece of hardware or software. Pointing fingers when problems occur is not the best approach. And selling any standard as a fail-safe means of protecting all of your data will never work. So I think that PCI compliance DOES mean something, but maybe not what some folks think it means. It should mean that your organization has taken best practice steps to avoid intrusions and data theft – that is, reasonable, up-to-date measures as determined by people knowledgeable in the area. It does not mean that your organization will not experience a problem!
To ensure that PCI continues as a valid standard, a continuing group of industry consultants and IT professionals should regularly be working to augment the standard based on the latest technology improvements and security breakdowns. And a schedule should be produced for the publication of revised standards on a regular basis – yearly, or maybe every other year makes sense to me. And funding a body with the ability to conduct compliance audits is another component of ensuring the success of a regulatory standard. It may be the same body that produces the standard, or better yet, a separate body (because creating the standard and policing it require different skill sets). Industry standards (like PCI DSS) can police compliance better than governmental standards (like SOX) because they are more adept at imposing penalties (e.g., you cannot accept card payments) and funding the oversight process.
What standards – de facto or emerging – are you most keenly following?
Well, the best thing about standards is that there are so many to choose from! I am reminded of the story where two well-meaning IT folks were complaining about the multiple different standards there were for their specific area. One of the two says, “There are 14 different standards that we could apply to solve this problem in 14 different ways! What should we do?” And the other one says, “Let’s get together with some of our peers and find a way to synthesize and consolidate all 14 of these into a cohesive, integrated standard!” And so they did, and 18 months later there were 15 different standards to choose from!
Anyway, even with my skepticism about standards, there are a couple that I care about. Of course, the SQL standard is one that I always keep an eye on, but things are not very active on that front right now. I do believe, though, that as the NoSQL functionality gets adopted in the major relational systems, SQL functionality will grow and thereby impact the SQL standard. The world of RDF and the semantic web with SPARQL is interesting. And it is also worth following the Hadoop and MapReduce world to see what standards will emerge. Things are still kinda like the Wild West out there right now, though.
Finally, I’d like to mention that I think an area that could benefit from standardization is database administration. While it is true that the exact details of how databases are managed depends significantly on the DBMS being used and the features and functionality being supported. Nevertheless, the management discipline of database administration could benefit from a standards-based, best-practice kind of approach. For example, a standardized approach to backup/recovery with RTOs (recovery time objectives) driving the process; or more standardized performance management approaches with SLAs (service level agreements) driving the process. Today, a lot of DBA is done based on putting out fires and reacting to who complains the loudest instead of managing to stated objectives and requirements.