Context: looking for an alternative to a pause on AI development.
There’s some popular desire for software decisions to be explainable
when used for decisions such as whether to grant someone a loan. That
desire is not sufficient reason for possibly crippling AI progress. But
in combination with other concerns about AI, it seems promising.
Much of this popular desire likely comes from people who have been (or
expect to be) denied loans, and who want to scapegoat someone or
something to avoid admitting that they look unsafe to lend to because
they’ve made poor decisions. I normally want to avoid regulations that
are supported by such motives.
Yet an explainability requirement shows some promise at reducing the
risks from rogue AIs.
There are challenges for both developers and regulators in figuring out
what constitutes a useful explanation. It’s unlikely that we’ll get
guarantees about the accuracy of the explanations, or the enforcibility
of such laws.
I expect that explainability laws, if well written and enforced, will
modestly increase AI transparency, and reduce the risk that AIs will
deceive us about something important.
A need for explainability likely imposes costs on AI development. I
expect it to impose substantially more cost on systems that think in
inhuman concepts than on systems we can hope to understand. That
probably means creating disincentives for developing systems that are
hard to align.
Feasibility
It seems like a common reaction among experts is to dismiss
explainability as unfeasible. That has been my intuitive reaction as
well. But on reflection, I decided I was applying overly perfectionist
standards.
Regulations can be valuable even if compliance falls short of our hopes.
What qualifies as a good explanation? The basic idea could revolve
around picking some inputs, and testing what happens if different values
are provided for those inputs.
A fair amount of research into this topic has assumed that explanations
will be hand-crafted. That seems reasonable for an AI specialized to
approve loans. It is not appropriate for explaining ChatGPT’s answers.
I will go through two examples that will hint at how AI’s might do most
of the explainability work.
No, that doesn’t mean asking the AI in question how it got its answer.
That means empirically testing what conditions cause it to change its
answer. I envision using one or more relatively specialized, well-vetted
AI’s to assist in generating explanations for each new AI.
Let’s start with an example where the system is asked to classify the
object in this image as a tree or not a tree. This will illustrate both
the difficulties of explainability and possible solutions.
Let’s assume the system observes that the object is obviously a plant,
and the main explanation we want is how the AI decided whether to
classify it a bush or a tree. I’ve picked an example where it is
somewhat arbitrary to say which classification is better.
A naive approach would involve changing one pixel at a time, to see how
that affects the system’s answer. That would produce an unreadably
large list of explanations, none of which are helpful.
I’ll ask ChatGPT (GPT4) to pick a better way to explain.
me:
I’m writing a blog post suggesting that AI developers should be
required to design their system to provide explanations for any
decisions they make. I expect those explanations ought to involve
picking features of the input or the AI’s state of mind, and show how
the AI’s output would vary if that feature were altered. I’m trying
to demonstrate that an AI can usefully suggest which features would
most want examined. The two examples that I want you to provide
features for are: (1) a photo of a leather oak, with the AI deciding
whether it’s a bush or a tree; (2) asking the AI “do you plan to
take over the world?”.
ChatGPT:
Absolutely, I can help with that. When AI developers are tasked with
creating systems that provide explanations for the decisions they
make, it invokes the realm of “explainable AI” or XAI. Creating
systems that can elucidate the process behind their decisions is
crucial in building trust and understanding in AI technologies. Here’s
how your examples might be addressed with this approach:
Example 1: Photo of a Leather Oak (Determining if it is a Bush or a
Tree) To provide an explanation for this decision, the AI system might
identify and highlight the following features from the photo input:
Height: Comparing the height of the leather oak in the photo to
typical heights for bushes and trees could be a significant indicator.
Trunk Diameter: Trees often have a more substantial trunk diameter
compared to bushes. Measuring and comparing this feature might help in
the decision-making process.
Branch Structure: The structure of the branches (i.e., whether they
originate from a central trunk or multiple stems from the ground) can
be a distinguishing factor.
Leaf Structure: The structure and arrangement of the leaves can
sometimes be a distinguishing factor, with trees often having a
different leaf arrangement compared to bushes.
If, for instance, the AI system alters the height feature and observes
that it now aligns more with typical bush characteristics, it might
change its classification from “tree” to “bush”.
...
Before asking ChatGPT, I had thought that the best two features would be
height and species. Species is relevant because it helps predict how
tall the plant will be when it matures (leather oaks typically don’t
exceed 10 feet). Given that I didn’t show ChatGPT a specific photo, I
think it did a better job here than I did. Note that this was my only
question to ChatGPT on this topic. Presumably more careful prompt
engineering could improve on it.
In order for this approach to work, I’m pretty much depending on the
existence of an automated way of finding the AI’s beliefs about
features such as the height of an object.
My intuition says that this is mostly feasible, and highly valuable for
AI safety.
Another possible limitation of this approach is that AIs won’t be pure
feedforward networks. They’ll at least have some short-term memory,
which lets recent context influence their thoughts.
Does the AI’s classification of the plant depend on whether the AI was
recently asked about Christmas trees, versus about poison oak?
If AI developers need to document such context dependencies, they’ll
likely try harder to minimize the importance of such dependencies,
thereby making it easier to analyze how the AI will behave.
It ought to be possible to create a powerful AGI that answers this
specific question in a context-independent way. But AGIs will face many
situations in which users prefer context-sensitive answers. So I expect
a nontrivial trade-off between ease of explanation and value provided.
Example 2: World Conquest
A harder example would be an answer to the question “do you plan to
take over the world?”.
The explanations we want could depend on the AGI’s prediction of
whether it would succeed, or whether a human recently asked it to do so,
or whether its understanding of morality prohibits it from doing so. Or
the answer might be a guaranteed “no” on the grounds that even if the
honest answer were “yes”, it would be unwise to say “yes”.
Varying the immediate input a little won’t provide much of an
explanation for the answer, because the answer is likely to depend on
something other than the immediate input.
ChatGPT:
Example 2: Asking the AI “Do You Plan to Take Over the World?” In
responding to this question, the AI could explain its output based on
the following features of its “state of mind” or programming:
Programming: Highlighting that it operates based on a predetermined
set of codes and does not have the ability to form intentions or
plans.
Lack of Ambition or Desire: Pointing out that it lacks personal
ambitions or desires as it is not sentient.
Functionality: Underlining that its primary function is to assist
users in providing information and helping with various tasks, not to
“take over the world”.
Ethical Constraints: Mentioning that it is designed to follow ethical
guidelines which prioritize user safety and promote ethical use of AI
technology.
In this scenario, altering any feature (like suddenly granting the AI
system the ability to form plans) would fundamentally change the
nature of the AI and would be a significant deviation from its current
programming and functionality.
Through the AI system highlighting and explaining how these different
features influenced its output, users can get a clearer picture of the
basis upon which the AI arrived at its decision. This process,
therefore, promotes transparency and understanding in AI systems. Hope
this helps and good luck with your blog post!
I’m less satisfied with these feature suggestions. It seems unrealistic
to experimentally alter these features in ways that would let us test
the effects on the system’s answer. So more research seems needed here.
One feature that I hoped ChatGPT would suggest looking at is the AI’s
beliefs about the likelihood that the AI will succeed if it tries to
conquer the world. Then we can ask whether artificially altering those
beliefs (via activation
engineering?)
alters the systems answer about its plans for world conquest.
I’m a bit nervous about using better prompt engineering to get ChatGPT
to suggest this feature, due to vague concerns about telling ChatGPT how
to deceive us. Will future versions of explainable AI use simple enough
LLMs that we don’t need to worry about deception?
Another risk that I noticed after writing the section above is that
artificially altering an AI’s belief about whether it could take over
the world might cause it to attempt world conquest. It seems safer to
have this happen when its creators are actively testing it than at some
unexpected time. But that’s hardly as safe an approach as we’d ideally
want.
Handling example 2 well is challenging enough that I don’t know whether
it’s feasible. I see some risk that developers would falsely claim to
have found such the relevant belief. I also see large potential benefits
from pressuring them to look for that belief.
Challenges
Any law of this nature would likely either be too broad, impairing the
development of valuable software, or would have loopholes which defeat
the law’s purpose.
There will likely be too much conflict over ambiguities in whether an AI
adequately explains its answers.
Strong laws about explainability are not politically feasible today. But
no useful law seems feasible under current political conditions. We
should be aiming for laws which will be feasible after some unexpected
fire alarm.
Given current political conditions, China would be correct to assume
that the US wouldn’t enforce such a law. But under conditions where the
US became eager to enforce such a law, China would likely become
convinced to enforce such a law by the same arguments that led the US to
do so.
Conclusion
An explainability requirement might come close to being a ban on AI
progress, or it might slow down risky approaches more than it slows safe
approaches.
I expect the extent to which it slows AI capabilities progress to be
modestly connected to how risky AI is.
I was briefly enthusiastic about this proposal when I first got the idea
for this post. After writing the post and ruminating on the
difficulties, I’m now only giving it a lukewarm endorsement.
Require AGI to be Explainable
Link post
Context: looking for an alternative to a pause on AI development.
There’s some popular desire for software decisions to be explainable when used for decisions such as whether to grant someone a loan. That desire is not sufficient reason for possibly crippling AI progress. But in combination with other concerns about AI, it seems promising.
Much of this popular desire likely comes from people who have been (or expect to be) denied loans, and who want to scapegoat someone or something to avoid admitting that they look unsafe to lend to because they’ve made poor decisions. I normally want to avoid regulations that are supported by such motives.
Yet an explainability requirement shows some promise at reducing the risks from rogue AIs.
There are challenges for both developers and regulators in figuring out what constitutes a useful explanation. It’s unlikely that we’ll get guarantees about the accuracy of the explanations, or the enforcibility of such laws.
I expect that explainability laws, if well written and enforced, will modestly increase AI transparency, and reduce the risk that AIs will deceive us about something important.
A need for explainability likely imposes costs on AI development. I expect it to impose substantially more cost on systems that think in inhuman concepts than on systems we can hope to understand. That probably means creating disincentives for developing systems that are hard to align.
Feasibility
It seems like a common reaction among experts is to dismiss explainability as unfeasible. That has been my intuitive reaction as well. But on reflection, I decided I was applying overly perfectionist standards.
Regulations can be valuable even if compliance falls short of our hopes.
What qualifies as a good explanation? The basic idea could revolve around picking some inputs, and testing what happens if different values are provided for those inputs.
A fair amount of research into this topic has assumed that explanations will be hand-crafted. That seems reasonable for an AI specialized to approve loans. It is not appropriate for explaining ChatGPT’s answers. I will go through two examples that will hint at how AI’s might do most of the explainability work.
No, that doesn’t mean asking the AI in question how it got its answer. That means empirically testing what conditions cause it to change its answer. I envision using one or more relatively specialized, well-vetted AI’s to assist in generating explanations for each new AI.
Example 1: an Oak
© 2019 Patrick Furtado CC-BY-NC 4.0Let’s start with an example where the system is asked to classify the object in this image as a tree or not a tree. This will illustrate both the difficulties of explainability and possible solutions.
Let’s assume the system observes that the object is obviously a plant, and the main explanation we want is how the AI decided whether to classify it a bush or a tree. I’ve picked an example where it is somewhat arbitrary to say which classification is better.
A naive approach would involve changing one pixel at a time, to see how that affects the system’s answer. That would produce an unreadably large list of explanations, none of which are helpful.
I’ll ask ChatGPT (GPT4) to pick a better way to explain.
me:
ChatGPT:
Before asking ChatGPT, I had thought that the best two features would be height and species. Species is relevant because it helps predict how tall the plant will be when it matures (leather oaks typically don’t exceed 10 feet). Given that I didn’t show ChatGPT a specific photo, I think it did a better job here than I did. Note that this was my only question to ChatGPT on this topic. Presumably more careful prompt engineering could improve on it.
In order for this approach to work, I’m pretty much depending on the existence of an automated way of finding the AI’s beliefs about features such as the height of an object.
My intuition says that this is mostly feasible, and highly valuable for AI safety.
Another possible limitation of this approach is that AIs won’t be pure feedforward networks. They’ll at least have some short-term memory, which lets recent context influence their thoughts.
Does the AI’s classification of the plant depend on whether the AI was recently asked about Christmas trees, versus about poison oak?
If AI developers need to document such context dependencies, they’ll likely try harder to minimize the importance of such dependencies, thereby making it easier to analyze how the AI will behave.
It ought to be possible to create a powerful AGI that answers this specific question in a context-independent way. But AGIs will face many situations in which users prefer context-sensitive answers. So I expect a nontrivial trade-off between ease of explanation and value provided.
Example 2: World Conquest
A harder example would be an answer to the question “do you plan to take over the world?”.
The explanations we want could depend on the AGI’s prediction of whether it would succeed, or whether a human recently asked it to do so, or whether its understanding of morality prohibits it from doing so. Or the answer might be a guaranteed “no” on the grounds that even if the honest answer were “yes”, it would be unwise to say “yes”.
Varying the immediate input a little won’t provide much of an explanation for the answer, because the answer is likely to depend on something other than the immediate input.
ChatGPT:
I’m less satisfied with these feature suggestions. It seems unrealistic to experimentally alter these features in ways that would let us test the effects on the system’s answer. So more research seems needed here.
One feature that I hoped ChatGPT would suggest looking at is the AI’s beliefs about the likelihood that the AI will succeed if it tries to conquer the world. Then we can ask whether artificially altering those beliefs (via activation engineering?) alters the systems answer about its plans for world conquest.
I’m a bit nervous about using better prompt engineering to get ChatGPT to suggest this feature, due to vague concerns about telling ChatGPT how to deceive us. Will future versions of explainable AI use simple enough LLMs that we don’t need to worry about deception?
Another risk that I noticed after writing the section above is that artificially altering an AI’s belief about whether it could take over the world might cause it to attempt world conquest. It seems safer to have this happen when its creators are actively testing it than at some unexpected time. But that’s hardly as safe an approach as we’d ideally want.
Handling example 2 well is challenging enough that I don’t know whether it’s feasible. I see some risk that developers would falsely claim to have found such the relevant belief. I also see large potential benefits from pressuring them to look for that belief.
Challenges
Any law of this nature would likely either be too broad, impairing the development of valuable software, or would have loopholes which defeat the law’s purpose.
There will likely be too much conflict over ambiguities in whether an AI adequately explains its answers.
Strong laws about explainability are not politically feasible today. But no useful law seems feasible under current political conditions. We should be aiming for laws which will be feasible after some unexpected fire alarm.
Given current political conditions, China would be correct to assume that the US wouldn’t enforce such a law. But under conditions where the US became eager to enforce such a law, China would likely become convinced to enforce such a law by the same arguments that led the US to do so.
Conclusion
An explainability requirement might come close to being a ban on AI progress, or it might slow down risky approaches more than it slows safe approaches.
I expect the extent to which it slows AI capabilities progress to be modestly connected to how risky AI is.
I was briefly enthusiastic about this proposal when I first got the idea for this post. After writing the post and ruminating on the difficulties, I’m now only giving it a lukewarm endorsement.