The Hidden Hand of Data Bias
Editor’s note: This article on data bias written by Syncsort’s Harald Smith was originally published on Infoworld.
Biased data and decisions represent significant risks to your organization both monetarily and ethically and ultimately may impact your ability to achieve revenue goals and maintain brand reputation
On March 27, 2018, amid other recent scandals, the National Fair Housing Alliance and three other organizations filed a lawsuit against Facebook alleging that Facebook’s advertising platform enables landlords and real estate brokers to discriminate against several classes of people, preventing them from fairly receiving relevant housing ads. The outcome, and potential cost, is yet unknown.
We also do not know yet whether this lawsuit stems from deep issues with data bias, poor (or unethical) business decisions, or both. But organizations looking to increase data literacy across their staff and make data-driven business decisions must raise awareness of data bias and its costs.
What data bias looks like
You may think that information collected by sensors and applications must be free from data bias by definition—after all, billions of data points are being collected in a neutral way. But that isn’t necessarily true. Consider what happened when the city of Boston released a smartphone app to help locate and fix potholes. As reported, there was a hidden bias in the data. Penetration of smartphone usage among the elderly and low-income populations was only around 16 percent, significantly skewing reports of issues away from those areas and residents who needed services most. Cases such as tweets during Hurricane Sandy or Google Flu Trends are further examples of how data bias can negatively impact public services.
Consequences of biased data
Discriminatory practices are one consequence of biased data. This may occur in cases such as skewing who sees job ads, as a recent study demonstrated. Google’s facial recognition software has also been highlighted as producing significant racial bias. As in the case of the lawsuit against Facebook, the consequences of biased data can be more serious than making the wrong decision or allocating resources improperly: You may be violating the law without knowing it. Sprint and Time Warner have both incurred multi-million-dollar fines for such issues from the FTC.
A brief look at different types of bias
There are too many types of data bias to list here. However, a couple of the most common types of bias are useful to highlight.
- Selection bias: In this case, you are working with a subset of the data instead of a valid sample across the whole population. A data set that only includes male customers in Boston, for example, will not provide insight into the population of potential customers across New England. Similarly, using only tweets to assess a product’s success will not give you insight into the broader population that uses the product but doesn’t tweet.
- Cause-effect bias: We’re trained to look for correlations and patterns. However, these embedded data relationships may not only be superfluous, but may skew subsequent analysis. I heard a good example at a conference last year relating to data on the Titanic. In that case, the data had highly correlated columns connecting those who survived to those who got onto lifeboats. But neither variable gives any insight into better factors of what cause may have improved chance of surviving. Customer data that we regularly use has many such correlated pieces, such as city and postal code.
How can you address data bias?
You need to understand and communicate to your teams what bias is and how it may impact their work with data. Not only is it a foundational part of implementing a data literacy strategy, and critical to effective and ethical data-driven business decisions, but not understanding it could mean serious consequences for your business. Part of this strategy is to teach people working with data how to use data profiling, data preparation, or BI tools to identify potential areas of bias. For instance:
- When profiling data, think about “completeness,” not just in terms of whether a field is populated, but whether the data set is complete relative to the target population.
- When evaluating a set of codes (age, gender) or dates, look for unexpected skews to the information out of line with broad sources such as census or demographic data.
- If assessing associations between multiple fields, look for pieces of data representing a similar concept (city and state vs. postal code), or that highlight highly correlated variables.
- Asking questions about gaps in data collection is key. It often helps to understand who gathered the data and what their goals were, particularly if it’s third-party data.
These are core data-quality practices that need to be incorporated when helping establish data literacy in your organization. Further, you want to incorporate basic scientific methods. Ensure your teams understand the different types of bias and watch for bias in the questions asked or targeted goals. Alternate hypotheses need to be raised, evaluated, and presented in testing algorithms and reviewing analysis before leaping to specific decisions.
Protecting your company from data bias
Biased data and decisions represent significant risks to your organization both monetarily and ethically and ultimately may impact your ability to achieve revenue goals and maintain brand reputation. To prevent biased decisions, you must understand what your business goals are and be able to review the data in use and test different hypotheses. The results need to be incorporated into a review process that assesses biases, risks, and ethical considerations.
This is not something to delegate to an overworked governance or risk and compliance team, but needs to be embedded into, communicated through, and practiced throughout your organizational culture. Incorporating such an approach, and asking these types of questions, may well make the difference in preventing damage to your company—whether through fines, reputation management, or both.
Check out our latest eBook and learn the Strategies for Improving Big Data Quality for BI and Analytics.