Algorithmic Bias in Software Design and AI

By Vivi Hoang
Dec. 10, 2021

Software, as a product of human design, by its very nature has our human worldviews encoded into it, whether intentionally or unintentionally. Artificial intelligence (AI) further plays out that facet of the system context, such as in the AI subfield of machine learning, in which algorithms evolve and adapt based on insights gleaned from large datasets.

By now, this spectrum of automation pervades our everyday lives. But the tech industry gravitates toward releasing products as quickly as possible, typically without thinking through possible ethical and societal ramifications.¹

One resulting risk is the automation of bias: As a consequence of the widespread use of algorithms, bias has been shown to manifest in countless areas with real-world consequences, including criminal justice, education, employment, healthcare, housing and certainly social media.

Many of these industries incorporate these tools “with the promise that algorithmic decisions are less biased than their human counterpart,” writes Princeton sociologist and Ida B. Wells Just Data Lab founder Ruha Benjamin. “But human decisions comprise the data and shape the design of algorithms, now hidden by the promise of neutrality and with the power to unjustly discriminate at a much larger scale than biased individuals.”²

This paper examines contributing factors to algorithmic bias; considerations to take during the software design process; mitigation methods and recent examples of algorithmic bias and their impact in a variety of sectors.

An academic example

Algorithmic bias arises when issues with a computer system, including errors and false assumptions, lead to unfair outcomes that favor one arbitrary group over another.³

“Algorithms are part of existing (biased) institutions and structures, but they may also amplify or introduce bias as they favor those phenomena and aspects of human behavior that are easily quantifiable over those which are hard or even impossible to measure,” wrote Ntoutsi, et al., in a 2020 paper.⁴

For instance, from 2013 to 2019, the University of Texas at Austin’s computer science department relied on a machine-learning system to help winnow down PhD applicants. A faculty member and a graduate student from the department developed the system, called GRADE (GRaduate ADmissions Evaluator), to reduce the amount of time required of reviewers.

In that, the system succeeded, cutting reviewers’ time commitment by 74%.⁵ In particular, it freed up reviewers to devote more time to candidates on the borderline of acceptance or rejection, especially as the number of applications grew from about 250 in 2000 to more than 1,200 after 2012.⁶

For each applicant, the system provided reviewers with a numerical score out of five along with an explanation of factors supporting that score. For example, applicants with higher GPAs and from prestigious undergraduate alma maters scored higher. Letters of recommendation containing keywords like “best,” “award” and “research” scored higher than those with “good,” “programming” or “technology.”⁷

Critics of this approach countered that the system could undervalue women’s colleges and historically Black universities, nor take gender bias into consideration when evaluating letters of evaluation, since different language is often used to describe female students than male students.

“If I ask you to do a classifier of images and you’re looking for dogs, I can check afterwards that, yes, it did correctly identify dogs,” said Steve Rolston, chair of the University of Maryland at College Park’s physics department during a 2020 colloquium talk with the GRADE creators. “But when I’m asking for decisions about people, whether it’s graduate admissions, or hiring or prison sentencing, there’s no obvious correct answer. You train it, but you don’t know what the result is really telling you.”

The University of Texas phased out GRADE’s use in 2020 (before the controversy was raised later that year), citing that it was “too difficult to maintain” due to “changes in the data and software environment.”⁸

Contributors to algorithmic bias

During the COVID pandemic, machine learning contributed to the development of at least four vaccines that made it to the clinical evaluation stage. AI was used analyze records and medical imagery of COVID patients.

But researchers from the United Nations and the Université de Montréal still questioned whether using AI would propagate or reduce inequality. They identified possible points of entry for bias during software development and ways to mitigate it.⁹

Bias, they said, can creep into software at these phases:

When the problem is framed and scoped. This can be viewed as “problem formulation” or how the problem to be solved is defined.¹⁰
In the data itself that’s fed into the AI system, especially since “Women and minorities are also often not properly represented in data sets whose use may result in medical treatments and services.”¹¹
When the algorithms are designed, depending on their configuration and parameters.
The evaluation and interpretation of the results and how they’re used.

Framing and scoping the problem

Health researchers in 2019 found that a widely used commercial algorithm that predicts healthcare costs for millions of Americans showed significant racial bias: Black patients with the same risk score as white patients were actually substantially sicker due to health providers spending less on their care.

The researchers attributed the disparity partly to the way the problem was formulated: on future health-related costs. While the desire to predict that figure isn’t unreasonable — patients with higher future costs probably would benefit the most from receiving additional resources — they’re not the only option. Alternatives include avoidable future costs, like visits to the emergency room, or a general future health score, one based on the patient’s current chronic health conditions.

The challenge of problem formulation lies in translating “an often amorphous concept we wish to predict into a concrete variable that can be predicted in a given dataset.”¹²

In this particular case, the researchers suggested merging both health prediction and cost prediction, a more nuanced approach that considered not just a patient’s number of conditions but their severity to prevent underestimation of health risks. Because their study had access to inputs, outputs and eventual patient outcomes, they were able to ascertain that formulating the problem this way resulted in an 84% reduction in bias.¹³

Assessing the data

One of the challenges of algorithmic bias is the very data used to train automated systems. On the Internet, this played out in one example of tags associated with professional soccer players: A search for male player Lionel Messi showed primary tags related to his professional career (Argentina, Barcelona and his rival Cristiano Ronaldo) while a search for female player Megan Rapinoe showed tags related primarily to her appearance and family life.

“This is a lucid example of modern-day societal stereotype and how humans perceive them, which is then mistakenly encoded into the Tagging Algorithm,” wrote researchers from the Chaitanya Bharathi Institute of Technology in Hyderabad, India. “… Biases can be fed into the model through training data, due to either human assumptions while labelling or inadequate sampling.”¹⁴

Deficient or problematic data crops up when using historic health care records, for instance, since they could be derived from segregated hospital facilities, discriminatory medical practices and curricula and inequitable insurance access.¹⁵

The COVID researchers mentioned previously point out that while diagnosis of the coronavirus is possible from computer tomography (CT) and X-ray scans, that’s assuming robust, balanced and representative data sets. Misdiagnosis is possible if the algorithms solely consider medical imagery, particularly in localities that also see a high occurrence of other lung-related conditions. But that data may not always be possible because not all regions have the budget for the proper scanner equipment.¹⁶

Mitigation methods

Researchers have suggested a number of ways to reduce algorithmic bias during software design — though, like any complicated problem, the solutions themselves are nontrivial, particularly when addressing the issue of fairness.

These recommendations include:

Design algorithms from the ground up as a way of being explicitly aware of values.¹⁷
Run simulations to test algorithm results.¹⁸
Random choices that are verifiable.¹⁹
Regulate proxy-driven systems.²⁰
Transparency. Twitter, for example, has taken it upon itself to release such information to the public in the hopes that doing so will help it improve its services. “Responsible AI is hard in part because no one understands fully understands decisions made by algorithms,” wrote tech journalist Casey Newton of Twitter’s disclosures.²¹ On the other hand, transparency — specifically in the context of source code — has also been argued against for privacy and security reasons.²²

Below, we’ll delve deeper into a few of these methods.

Reproducible randomization

In cases in which an algorithm requires randomization, such as a lottery, there needs to be a way to make the process “fully reproducible and reviewable” to confirm its fairness. Otherwise, there’s no way to know if automated outputs aren’t random, or perhaps even producing an improperly influenced outcome.

The challenge here is that it’s difficult to recreate a process that’s been randomized: For example, if its behavior depends on input from its environment, that can make its outputs nondeterministic.

To address this proactively, the algorithm should be designed from the start with this oversight in mind; it needs to be able to demonstrate that its random choices do not bias its output. To do so, when testing, the software can replace the random choice with a known seed value from which the random values can be generated in a controlled, pseudorandom manner. That way the algorithm can be rerun as long as the seed is known, allowing testers to review and verify the algorithm’s randomness without have to generate a wholly random choice every time the software needs that value.

“… [T]his technique reduces the relevant portion of the environment to a very small and manageable value (the seed) while preserving the benefits of using randomness in the system,” writes Joshua Kroll, a computer scientist at the Naval Postgraduate School in California.²³

The use of proxy variables

It is standard practice for models to use what’s known as proxy variables, which is when the actual value of interest is unobservable, too expensive to measure — or, in the case of attributes such as race or gender, federally not permitted to be a consideration factor, such as in employment — so it’s substituted with a correlating, more easily measurable value.

But the mainstreaming of proxy-driven analysis has worsened inequity due to poor proxy choices often disconnected from the desired values and how they’re treated without regard to the bigger picture. One hypothetical example provided by a computer scientist and legal scholar pair, Sloan and Warner, was a woman who declares bankruptcy after defaulting on a credit card debt racked up from her daughter’s medical treatment. After the bankruptcy, with her daughter recovered, the woman would be a good credit risk but the bankruptcy remains a black mark on her record, which leads to a poor credit score. Consequently, the insurance company, which uses the credit rating as a proxy for safe driving, raises her auto insurance rates.

To help mitigate the algorithmic bias that could result from the use of proxies, the researchers propose two regulating conditions. The first is that the system identify “before start” attributes that unfairly favor one group over another and then not use them in the system. For instance, a system should not penalize individuals who took part in a remedial reading program that was used to ensure all children had the same basic reading abilities.

The second condition is to build AI systems in a way that recognizes how certain uses of “after start” attributes could sway the outcome unfairly and to then not use the attribute in that way. They use the bankruptcy example again, comparing the original system that automatically raises the woman’s insurance premiums against another that considers not just the woman’s bankruptcy but other cases like hers to better gauge the circumstances, thus avoiding that same outcome.²⁴

Transparency

Whether algorithms created by artificial intelligence are useful depend on how they’re developed and “how transparently their outputs are evaluated.”²⁵

One example used to make this point is an AI-driven Twitter chatbot Microsoft released in 2016 called Tay. Tay was designed to interact with and learn from the Twitter community as an experiment in “conversational understanding.” Though its initial tweets (“new phone who dis?”) seemed innocuous, within 24 hours, its messages had devolved to reflect the inflammatory content being tweeted at it with its own equally offensive remarks.²⁶

Although Microsoft nor Twitter certainly never intended for Tay to turn into a “raving bigot,” it could be argued that Tay was performing successfully as a chatbot within the context of its programming.²⁷ Within the realm of human evaluation, however, Tay’s performance was clearly a failure: This is largely because Tay’s outputs, its tweets, were directly observable to the public. Had that not been the case — had Tay’s job been relegated in the background to, for example, score tweets for how provocative they were and inadvertently scoring those with abusive content higher — then it would have been much more difficult to detect.²⁸

Twitter itself actually experienced a very similar challenge with nontransparent outputs earlier this year. The social media service, which uses machine learning for its tweet recommendation algorithms, shared in October an analysis showing how their service amplified political content. The company found, among other things, that its algorithms promoted tweets from right-wing politicians and news sources more often than those with left-wing leanings.²⁹

Why this is happening as yet is unclear; Twitter says determining the cause of this inequity is the company’s next step.

“Establishing why these observed patterns occur is a significantly more difficult question to answer as it is a product of the interactions between people and the platform,” the company wrote in a blog post. “… Algorithmic amplification is problematic if there is preferential treatment as a function of how the algorithm is constructed versus the interactions people have with it.”³⁰

Conclusion

The increasing reliance on algorithmic automation means the greater likelihood of algorithmic bias, which has been shown to have real-world consequences that crop in all areas of life. This makes it not just a software design problem but a human problem, one that researchers, software professionals, governments and companies are seeking to address and fix using a variety of techniques.

At the same time, the desire to proactively account for algorithmic bias means also acknowledging it a reflection not only of technology but society itself.

“…[B]iases are deeply embedded in our societies and it is an illusion to believe that the AI and bias problem will be eliminated only with technical solutions,” wrote Ntousi, etc. al., in a paper about bias in data-driven AI systems. “Nevertheless, as the technology reflects and projects our biases into the future, it is a key responsibility of technology creators to understand its limits and to propose safeguards to avoid pitfalls. Of equal importance is also for the technology creators to realize that technical solutions without any social and legal ground cannot thrive and therefore multidisciplinary approaches are required.”³¹

Miller, C. C. (2015, July 9). When Algorithms Discriminate. The New York Times. https://www.nytimes.com/2015/07/10/upshot/when-algorithms-discriminate.html ↩︎
Benjamin, R. (2019, October 25). Assessing risk, automating racism: A health care algorithm reflects underlying racial bias in society. Science, 366(6464), pp. 421-422. https://doi.org/10.1126/science.aaz3873 ↩︎
Florida State University Libraries. (2021, September 23). Algorithm Bias. https://guides.lib.fsu.edu/algorithm ↩︎
Ntoutsi, E., Fafalios, P., Gadiraju, U., Iosifidis, V., Nejdl, W., Vidal, M., … Broelemann, K. (2020). Bias in data‐driven artificial intelligence systems: An introductory survey. WIREs: Data Mining & Knowledge Discovery, 10(3), 1-14. https://doi.org/10.1002/widm.1356 ↩︎
Waters, A., & Miikkulainen, R. (2014, March 22). GRADE: Machine Learning Support for Graduate Admissions. AI Magazine, 35(1), 64-75. https://doi.org/10.1609/aimag.v35i1.2504 ↩︎
Burke, L. (2020, December 14). The Death and Life of an Admissions Algorithm. Inside Higher Ed. https://www.insidehighered.com/admissions/article/2020/12/14/u-texas-will-stop-using-controversial-algorithm-evaluate-phd ↩︎
Waters & Miikkulainen, 2014 ↩︎
Burke, 2020 ↩︎
Luengo-Oroz, M., Lam, C., Bullock, B., Luccioni, A., & Pham, K. (2021). From Artificial Intelligence Bias to Inequality in the Time of COVID-19. IEEE Technology and Society Magazine, Technology and Society Magazine, IEEE, IEEE Technol. Soc. Mag, 40(1), 71-79. https://doi.org/10.1109/MTS.2021.3056282 ↩︎
Benjamin, 2019 ↩︎
Luengo-Oroz et al., 2021 ↩︎
Obemeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019, October 25). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), pp. 447-453. https://doi.org/10.1126/science.aax2342 ↩︎
Ibid. ↩︎
Ahmed, S., Athyaab, S., & Muqtadeer, S. (2021). Attenuation of Human Bias in Artificial Intelligence: An Exploratory Approach. 6th International Conference on Inventive Computation Technologies (ICICT), (pp. 557–563). https://doi.org/10.1109/ICICT50816.2021.9358507 ↩︎
Benjamin, 2019 ↩︎
Luengo-Oroz et al., 2021 ↩︎
Miller, 2015 ↩︎
Miller, 2015 ↩︎
Kroll, J., Huey, J., Barocas, S., Felten, E., Reidenberg, J., Robinson, D., & Yu, H. (2017). Accountable Algorithms. University of Pennsylvania Law Review, 165(3), 633-706. https://scholarship.law.upenn.edu/penn_law_review/vol165/iss3/3 ↩︎
Sloan, R., & Warner, R. (2020). Beyond Bias: Artificial Intelligence and Social Justice. Virginia Journal of Law & Technology, 24(1), 1-32. https://dx.doi.org/10.2139/ssrn.3530090 ↩︎
Newton, C. (2021, November 19). How Twitter got research right: While other tech giants hide from their researchers. The Verge. https://www.theverge.com/2021/11/19/22790174/twitter-research-transparency-published-findings ↩︎
Kroll, et al., 2016 ↩︎
Kroll, et al., 2016 ↩︎
Sloan & Warner, 2020 ↩︎
Omar, R. (2021). Unabashed Bias: How Health-Care Organizations Can Significantly Reduce Bias in the Face of Unaccountable AI. Denver Law Review, 98(4), 807-837. ↩︎
Vincent, J. (2016, March 24). Twitter taught Microsoft’s AI chatbot to be a racist asshole in less than a day. The Verge. https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist ↩︎
Omar, 2021 ↩︎
Omar, 2021 ↩︎
Chowdhury, R., & Belli, L. (2021, October 21). Examining algorithmic amplification of political content on Twitter. https://blog.twitter.com/en_us/topics/company/2021/rml-politicalcontent ↩︎
Ibid. ↩︎
Ntoutsi, et al., 2020 ↩︎