Big Data Quality: GIGO Lives On
Data quality was understood to be a major challenge even before computing was widely adopted. The expression Garbage In, Garbage Out (GIGO) first appeared in print in 1963. The expression “GIGO” has since fallen into disuse, but the amplifying effects of Big Data’s Four V’s — Velocity, Volume, Variety, Veracity — are expected to bring data quality roaring back into prominence.
In particular, Big Data system architects should expect to face unanticipated challenges in data quality due to Velocity. Traditional lines of responsibility could once be cleanly drawn between custodians of batch, interactive and real time data systems. “I don’t do real time” was once a perfectly valid excuse for opting out of an information technology initiative. That excuse may not cut it for tomorrow’s Big Data systems.
Still more difficulties lie beyond Big Data Velocity. For instance, MIT’s Elizabeth Bruce says that Variety is considered one of the most difficult challenges. Regardless of which “V” receives the attention, how well understood are the risks to Big Data integrity?
Big Data Quality Risks Lurk for Many Applications
Data Quality #Fail
In a 2012 white paper establishing the need for “information management” products like its own InfoSphere suite, IBM analysts claimed that:
The primary reason that 40% of business initiatives fail is due to poor quality data. Data inconsistencies, lack of completeness, duplicate records, and incorrect business rules often result in inefficiencies, excessive costs, compliance risks and customer satisfaction issues. Therefore improving the quality of your enterprise data will have a huge impact on your business.
It isn’t glamorous, but mundane causes like lapses in data quality may become a routine aspect of failure analysis in many sectors. In hospital settings, for example, increased reliance on digital instrumentation could lead to cascading failures as one system relies upon the outputs from another. Human users of such systems are often unaware of machine tolerances and valid data ranges. Instead, healthcare providers must rely upon system architects to assess data quality and act accordingly – potentially in real time.
The medical risk scenario is not imaginary, as ThreatPost observed:
Software failures were behind 24 percent of all the medical device recalls in 2011, according to data from the U.S. Food and Drug Administration.
Data Quality Standards
While data quality standards may not yet fully address Big Data, two existing ISO standards offer worthwhile guidance. ISO 8000 is a multi-part standard which includes a focus on master data management. ISO 15926 had process industry data integration in its original sights, but others may find it useful as well. A particular strength of the ISO 15926 approach is its focus on the lifecycle of a facility. (For related quality efforts, ANSI offers NSSN, a search engine dedicated to standards search.) The International Association for Information and Data Quality (IAIDQ) is a relatively new (2004) organization which certifies professionals in its Information Qualified Certified Professional regimen.
Solutions: Six Sigma and Other Quality Standards
SDLC Training Gaps
Unfortunately, data quality is lightly taught in software engineering curricula. Its appearance on the worry list of practitioners rather than course prerequisites for software engineers is perhaps revealing. If data quality is not seen as part of the Software Development Life Cycle (SDLC), system architects may be forced to integrate verification and resilience after the fact. Earlier attention to quality issues would likely improve quality and also mitigate software maintenance costs.
The ABC’s of ETL
Enterprises with fairly mature data warehouses, or at least self-contained islands of business intelligence, are likely to have existing catalogs of ETL rules. Implemented using tools like Syncsort’s DMX, DMX-h or Ironcluster, close examination of these catalogs can reveal patterns in data quality administration. While it may be counter-intuitive in the heyday of unstructured data (think Hadoop), Big Data is likely to involve a push toward greater data discipline.
Big Data Quality Cross-Examined
In some organizations, it could make sense to make data quality part of regulatory and compliance responsibilities. Advocates of this method prefer the expression Data Governance. Regardless of the name given to it, every processing stage involves risks of quality lapses. These include acquisition, calibration, identification / de-identification, aggregation – even data analytics.
A concrete example from medicine makes the point well. Consider instrumentation traffic – “biotelemetry” – from a laboratory device collected in different scenarios:
- Biotelemetry from a Patient A to a monitoring application
- Biotelemetry from Patient A admitted a year ago
- Biotelemetry from the same machine, but for Patient B
- De-identified biotelemetry from Patients A and B submitted to a cancer registry
- Biotelemetry from a different model of the same machine to Patient A
If this example somehow falls short, add a forensics subpoena for laboratory data to the list of scenarios. Patient A died and the family has filed a malpractice lawsuit claiming that a data failure directly caused the death. Data quality could well become a topic in a courtroom near you — with a Big Data system asked to take the stand.
Mark Underwood writes about knowledge engineering, Big Data security and privacy.