Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation

Hadoop for Audit

Somewhere in windowless unglamorous temp space reserved for auditors, there should have been quiet celebration. Syncsort had just announced Ironstream, its tool to move massive mainframe log files to Splunk. It was now possible to ingest SMF, Log4J, SYSLOGs and other mainframe high-volume records into Splunk Enterprise. Even without Splunk, organizations already had the option of using Syncsort DMX-h, an ETL tool to ingest high volume audit and ops data into Hadoop.

But many fear the Big Data story has yet to reach those sequestered cubicles.


Big Data for audit may represent a broad step forward in GRC.

Big Four Big Data

It’s no accident that professional services firms originally engaged only for audit duties – such as for tax or public accounting functions – expanded into other services within customer organizations. A rich history of series of mergers and acquisitions has left four very large multinational firms which are responsible for auditing almost all publicly traded companies, and many private ones, too. Taken together, these four firms — Deloitte, PwC (PricewaterhouseCoopers), EY (Ernst & Young) and KPMG — account for almost three quarters of a million professionals worldwide.

The work of these many thousand professionals is usually distilled to one image, one single activity – the financial audit. A goal of financial audits is to assess the validity and reliability of information provided by a company. In addition, auditors are responsible for studying a company’s internal controls. Of course, if one reads the fine print, it is worth noting that audit findings are only as valid as the evidence provided by the company and the systems used to manage that evidence.

Outside of public accounting circles, the other work performed by audit firms is less well understood, but is at least as important as financial auditing. The realm of Governance, Regulation and Compliance (GRC) is one where Big Data can change what goes on in those windowless offices.

GRC Agility

For well-understood regulations like Sarbannes-Oxley, guidelines from the Public Company Accounting Oversight Board, and PCI Security Standards (“PCIDSS”), GRC has many dimensions.

Governance can include adherence to accounting standards across multi-national operations. Compliance can involve adherence to national and local taxation, employment and reporting. Regulation can involve industry-specific reporting that is mandated by national, state or local agencies, or mandated by court orders as a result of civil actions.

The audit challenge grows in direct proportion to the use of Big Data for transactional data in ecommerce. The recently described American Express Big Sync Platform was once separate data warehouses based on relational databases. Big Sync is now a single Big Data system that includes a recommender system to “make deals on the fly that merchants want consumers to take advantage of.” Big Sync is big. It consists of 17 server nodes, has around 300 cores and stores almost 1 petabyte of data per rack. Suppose the audit task is to judge the validity of revenue the company attributes to Big Sync-initiated transactions. Auditors might, at the least, need to sample transactions from Big Sync. To do so, they might need access to Big Sync’s Apache Solr, an enterprise search tool used in Big Sync, or perhaps hook Big Sync to Tableau or QlikView for data visualization.

Sleuth vs. Sleuth

Some facets of audit involve discovering anomalies, searching for patterns in logs. While the use of Hadoop and similar tools for network security is much discussed (e.g., recent startup VArmour ), Big Data auditors are likely to look beyond the increasingly well-understood area of cybersecurity to areas closer to home.

For example, in 2011 J. Perols identified financial statement fraud detection as a useful application of statistical and machine learning models. This approach can be extended using tools like Apache Mahout to operate on data stored in Hadoop or other Big Data repositories.

Despite considerable training within the Big Four in tools like Tableau and Splunk, audit professionals will typically need to partner with information technologists to further customer Big Data audit projects for customers. The key takeaway for auditors is this: Hadoop and other unstructured data repositories represent a cost-efficient way to collect data now – on a Big Data scale. This data can be subjected to immediate machine learning systems, or collected for later analysis.

One example of the trend is evident in the Financial Industry Regulatory Authority (Finra) proposal to run the SEC Consolidated Audit Trail , which is designed to help regulate markets in the world of high-frequency trading. The proposed platform incorporates Hadoop, Amazon Web Services, Hortonworks and Cloudera components to enable Big Data analytics to support market surveillance.

Such tools may well be needed to stay one step ahead of the misdeeds of bad actors. Some of the bad actors may well be insiders like Jérôme Kerviel, responsible for a $7B loss at the French Bank Société Générale. Otherwise audit firms could lose audit business to competitors – as well as to forensics specialists.


Johan Perols (2011) Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms. AUDITING: A Journal of Practice & Theory: May 2011, Vol. 30, No. 2, pp. 19-50.

Photo Credit: Ravi via Flickr

Related Posts