How important is explainability? Applying clinical trial principles to AI safety testing

The use of AI in consumer-facing businesses is on the rise — as is the concern for how best to govern the technology over the long-term. Pressure to better govern AI is only growing with the Biden administration’s recent executive order that mandated new measurement protocols for the development and use of advanced AI systems.

AI providers and regulators today are highly focused on explainability as a pillar of AI governance, enabling those affected by AI systems to best understand and challenge those systems’ outcomes, including bias.

While explaining AI is practical for simpler algorithms, like those used to approve car loans, more recent AI technology uses complex algorithms that can be extremely complicated to explain but still provide powerful benefits.

OpenAI’s GPT-4 is trained on massive amounts of data, with billions of parameters, and can produce human-like conversations that are revolutionizing entire industries. Similarly, Google Deepmind’s cancer screening models use deep learning methods to build accurate disease detection that can save lives.

These complex models can make it nearly impossible to trace where a decision was made, but it may not even be meaningful to do so. The question we must ask ourselves is: Should we deprive the world of these technologies that are only partially explainable, when we can ensure they bring benefit while limiting harm?

Even US lawmakers who seek to regulate AI are quickly understanding the challenges around explainability, revealing the need for a different approach to AI governance for this complex technology — one more focused on outcomes, rather than solely on explainability.

Dealing with uncertainty around novel technology isn’t new

The medical science community has long recognized that to avoid harm when developing new therapies, one must first identify what the potential harm might be. To assess the risk of this harm and reduce uncertainty, the randomized controlled trial was developed.

In a randomized controlled trial, also known as a clinical trial, participants are assigned to treatment and control groups. The treatment group is exposed to the medical intervention and the control is not, and the outcomes in both cohorts are observed.

By comparing the two demographically comparable cohorts, causality can be identified — meaning the observed impact is a result of a specific treatment.

Historically, medical researchers have relied on a stable testing design to determine a therapy’s long-term safety and efficacy. But in the world of AI, where the system is continuously learning, new benefits and risks can emerge every time the algorithms are retrained and deployed.

The classical randomized control study may not be fit for purpose to assess AI risks. But there could be utility in a similar framework, like A/B testing, that can measure an AI system’s outcomes in perpetuity.

How A/B testing can help determine AI safety

Over the last 15 years, A/B testing has been used extensively in product development, where groups of users are treated differentially to measure the impacts of certain product or experiential features. This can include identifying which buttons are more clickable on a web page or mobile app, and when to time a marketing email.

The former head of experimentation at Bing, Ronny Kohavi, introduced the concept of online continuous experimentation. In this testing framework, Bing users were randomly and continuously allocated to either the current version of the site (the control) or the new version (the treatment).

These groups were constantly monitored, then assessed on several metrics based on overall impact. Randomizing users ensures that the observed differences in the outcomes between treatment and control groups are due to the interventional treatment and not something else — such as time of day, differences in the demographics of the user, or some other treatment on the website.

This framework allowed technology companies like Bing — and later Uber, Airbnb and many others — to make iterative changes to their products and user experience and understand the benefit of these changes on key business metrics. Importantly, they built infrastructure to do this at scale, with these businesses now managing potentially thousands of experiments concurrently.

The result is that many companies now have a system to iteratively test changes to a technology against a control or a benchmark: One that can be adapted to measure not just business benefits like clickthrough, sales and revenue, but also causally identify harms like disparate impact and discrimination.

What effective measurement of AI safety looks like

A large bank, for instance, might be concerned that their new pricing algorithm for personal lending products is unfair in its treatment of women. While the model does not use protected attributes like gender explicitly, the business is concerned that proxies for gender may have been used when training the data, and so it sets up an experiment.

Those in the treatment group are priced with this new algorithm. For a control group of customers, lending decisions were made using a benchmarked model that had been used for the last 20 years.

Assuming the demographic attributes like gender are known, distributed equally and of sufficient volume between the treatment and control, the disparate impact between men and women (if there is one) can be measured and therefore answer whether the AI system is fair in its treatment of women.

The exposure of AI to human subjects can also occur more slowly for a controlled rollout of new product features, where the feature is gradually released to a larger proportion of the user base.

Alternatively, the treatment can be limited to a smaller, less risky population first. For instance, Microsoft uses red teaming, where a group of employees interact with the AI system in an adversarial way to test its most significant harms before releasing it to the general population.

Measuring AI safety ensures accountability

Where explainability can be subjective and poorly understood in many cases, evaluating an AI system in terms of its outputs on different populations provides a quantitative and tested framework for determining whether an AI algorithm is actually harmful.

Critically, it establishes accountability of the AI system, where an AI provider can be responsible for the system’s proper functioning and alignment with ethical principles. In increasingly complex environments where users are being treated by many AI systems, continuous measurement using a control group can determine which AI treatment caused the harm and hold that treatment accountable.

While explainability remains a heightened focus for AI providers and regulators across industries, the techniques first used in healthcare and later adopted in tech to deal with uncertainty can help achieve what is a universal goal — that AI is working as intended and, most importantly, is safe.

Caroline O’Brien is chief data officer and head of product at Afiniti, a customer experience AI company.

Elazer R. Edelman is the Edward J. Poitras professor in medical engineering and science at MIT, professor of medicine at Harvard Medical School and senior attending physician in the coronary care unit at the Brigham and Women’s Hospital in Boston.

DataDecisionMakers

Welcome to the TechForgePulse community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!