Model-Making with Big Advertising’s Big Data
In 2013, digital advertising passed broadcast television for the first time. IPG Mediabrands Magna Global estimated that digital advertising will continue to increase by $20 billion to $140 billion. Part of the reason for this growth is shifting viewer and listening preferences – especially trends toward mobile viewing. But another reason may have more to do with measurement and targeting than sheer numbers of views. The claim that digital ads are better because their performance is easier to measure goes back a couple of decades now. But is it? What does it take to develop predictive analytical models for ad impressions at scale?
Photo Credit: Internet Archive Book Images
For those not directly involved large scale model development in the digital advertising era, the scope of the challenges faced by analysts is not widely understood. Few would disagree that the process is non-trivial, and there are additional problems created by the advent of Big Data.
More than a Village
At the Association for Computing Machinery’s SIG on Knowledge Discovery and Data Mining (KDD) conference earlier this year, Sergei Izrailev and Jeremy Stanley at Collective Inc. offered detailed insight into how their team from the model-builders’ perspective. To co-opt the political saying, it takes more than a village to build, test and maintain multiple models in a dynamic environment.
Some of the challenges:
- Hundreds of terabytes may be needed to produce the training data set
- Reach- vs. Performance-based target objectives dictate different approaches
- Hundreds of campaigns are conducted concurrently
- All information is time-dependent
- Multiple data formats
- Multiple data sources
- Subjects with too little data
- Ads may be served in multiple countries in multiple languages
As is often the case in applications where Syncsort ETL comes into play, the data needs to be domesticated before it can be used. For instance, continuous data may need to be transformed into binary format using a particular variant of “data binning.” Categorical data, on the other hand, must be transformed into a binary format. In both cases, sparse or asymmetric data distribution must be taken into account when transformations are made. The issue is encountered with such regularity by data scientists that it’s been termed dimensionality reduction. (A complete discussion is beyond the scope of this brief mention, but dimensionality reduction involves both feature selection and feature extraction.)
For example, a large portion of users only receive a single ad impression because they are blocking or deleting cookies. Or some users may return to engage further with the site, but at a stage beyond what is estimated to be a reasonable time after an initial engagement. In both cases, model builders must judge which data is to be included and which is to be filtered out, taking into account either statistical or computing considerations.
What’s In the Toolbox
The choice of a modeling algorithm limited the platform choices considered by Collective. Because they had chosen an R application, gimnet, the models themselves had to be built in R. Collective explained their tooling selection process this way:
Constructing the data sets for modeling is very intensive in data processing and requires sufficient capacity for the system. Virtually all of our data is Virtually all of our data is structured and many relevant data transformations involve joins across data sets. These and other considerations led us to build the data-centric portion of our modeling platform on an IBM PureData (formerly, Netezza TwinFin) appliance. At its core is a parallel relational database with SQL interface, capable of storing and processing tables with hundreds of billions of rows. Since we needed to use both R and SQL within the system, we chose to standardize around these two languages.
Shout-out To Mahout
Those who wish to experiment in machine learning may consider using an open source tool. An oft-mentioned alternative is Apache Mahout. While Mahout was once closely linked to MapReduce, its developer community has recently embraced the use of a domain-specific language for linear algebraic operations, which produces code that can run in parallel on Apache Spark. The team suggests three apt use cases for Mahout: recommendation mining, clustering, and classification. Other alternatives include Weka and GraphLab.
While complaints about speed and ease of use have slowed adoption of machine learning applications, the need for data standardization and cleansing across increasingly diverse sources – beyond clicks and cookies to sensors and the Internet of Things – remains clear. The use of ETL to pull in clickstream data is just part of the story. Data from LiveRamp and Experian repositories could further enrich predictive models – and add still more complexity.
Anyone inclined to underestimate the power of machine learning should watch a rerun of IBM Watson in its fateful Jeopardy match.