Summary: “Imagining and building wise machines: The centrality of AI metacognition” by Johnson, Karimi, Bengio, et al.

Link post

Authors:

Samuel G. B. Johnson, Amir-Hossein Karimi, Yoshua Bengio, Nick Chater, Tobias Gerstenberg, Kate Larson, Sydney Levine, Melanie Mitchell, Iyad Rahwan, Bernhard Schölkopf, Igor Grossmann

Abstract:

Recent advances in artificial intelligence (AI) have produced systems capable of increasingly sophisticated performance on cognitive tasks. However, AI systems still struggle in critical ways: unpredictable and novel environments (robustness), lack of transparency in their reasoning (explainability), challenges in communication and commitment (cooperation), and risks due to potential harmful actions (safety). We argue that these shortcomings stem from one overarching failure: AI systems lack wisdom. Drawing from cognitive and social sciences, we define wisdom as the ability to navigate intractable problems—those that are ambiguous, radically uncertain, novel, chaotic, or computationally explosive—through effective task-level and metacognitive strategies. While AI research has focused on task-level strategies, metacognition—the ability to reflect on and regulate one’s thought processes—is underdeveloped in AI systems. In humans, metacognitive strategies such as recognizing the limits of one’s knowledge, considering diverse perspectives, and adapting to context are essential for wise decision-making. We propose that integrating metacognitive capabilities into AI systems is crucial for enhancing their robustness, explainability, cooperation, and safety. By focusing on developing wise AI, we suggest an alternative to aligning AI with specific human values—a task fraught with conceptual and practical difficulties. Instead, wise AI systems can thoughtfully navigate complex situations, account for diverse human values, and avoid harmful actions. We discuss potential approaches to building wise AI, including benchmarking metacognitive abilities and training AI systems to employ wise reasoning. Prioritizing metacognition in AI research will lead to systems that act not only intelligently but also wisely in complex, real-world situations.

Comments and Summary:

Why I’m Sharing This Article

I’m mainly sharing this because of the similarity of some of the ideas here to my ideas in Some Preliminary Notes on the Promise of a Wisdom Explosion. In particular, the authors talk about a “virtuous cycle” in relation to wisdom in the final paragraphs:

Second, by simultaneously promoting robust, explainable, cooperative, and safe AI, these qualities are likely to amplify one another. Robustness will facilitate cooperation (by improving confidence from counterparties in its long-term commitments) and safety (by avoiding novel failure modes; Johnson, 2022). Explainability will facilitate robustness (by making it easier to human users to intervene in transparent processes) and cooperation (by communicating its reasoning in a way that is checkable by counterparties). Cooperation will facilitate explainability (by using accurate theory-of-mind about its users) and safety (by collaboratively implementing values shared within dyads, organizations, and societies).

Wise reasoning, therefore, can lead to a virtuous cycle in AI agents, just as it does in humans. We may not know precisely what form wisdom in AI will take but it must surely be preferable to folly.

Defining wisdom

I also found their definition of wisdom quite clarifying. They begin by defining it as follows:

Though wisdom can mean many things, for this Perspective we define wisdom functionally as the ability to successfully navigate intractable problems— those that do not lend themselves to analytic techniques due to unlearnable probability distributions or incommensurable values

They argue:

If life were a series of textbook problems, we would not need to be wise. There would be a correct answer, the requisite information for calculating it would be available, and natural selection would have ruthlessly driven humans to find those answers

They list a number of specific types of intractability: incommensurability of values or goals, values changing over time, radical uncertainty[1], chaos[2], non-stationary generating processes[3], examples that are out-of-distribution, computational explosivity[4].

Next they note that this can be achieved through two different types of strategies:

(1) Task-level strategies are used to manage the problem itself (e.g., simple rules-of-thumb);

(2) Metacognitive strategies are used to flexibly manage those task-level strategies (e.g., understanding the limits of one’s knowledge and integrating multiple perspectives).

They then argue that although AI has made lots of progress with task-level strategies, it often neglects metacognitive strategies[5]:

For example, they struggle to understand their goals (“mission awareness;” Li et al., 2024), exhibit overconfidence (Cash et al., 2024), and fail to appreciate the limits of their capabilities and context (e.g., stating they can access real-time information or take actions in the physical world; Li et al., 2024). These failures appear to be symptoms of a broader metacognitive myopia, which leads GenAI models to unnecessarily repeat themselves, poorly evaluate the quality of information sources, and overweigh raw data over more subtle cues to accuracy (Scholten et al., 2024

Given the neglectedness, they decide to focus on elucidating these strategies. The paper also identifies a number of specific metacognitive processes:

Benefits:

They argue that wise AI offer many benefits:

• Robustness: They argue that metacognition would lead to AI’s rejecting strategies that produce “wildly discrepant results on different occasions”, allow it to identify biases and improve the ability of the AI to adapt to new environments.
• Explainability: They believe that metacognition would allow the AI to explain its decisions[6].
• Co-operation: They argue “wise metacognition is required to effectively manage these task-level mechanisms for social understanding, communication and commitment, which may be one factor underlying the empirical observation that wise people tend to act more prosocially”. They also argue that wisdom could enable the design of structures (such as constitutions, markets, organisations) that enhance co-operation in society.
• Safety: They note the difficulty of “exhaustively specify goals in advance”[7] and they suggest that wisdom could assist AI’s to emulate the human strategy of navigating goal hierarchies. They also argue that the greatest risk is currently systems not working well and that machine metacognition is useful for this, in particular, “AIs with appropriately calibrated confidence can target the most likely safety risks; appropriate self-models would help AIs to anticipate potential failures; and continual monitoring of its performance would facilitate recognition of high-risk moments and permit learning from experience.”

Comparison to Alignment:

They identify three main problems for alignment:

  1. Humans don’t uniformly prioritise following norms[8]

  2. Norms varying sharply across cultures

  3. Even if norms were uniform, they may not be morally correct

They then write:

Given these conceptual problems, alignment may not be a feasible or even desirable engineering goal. The fundamental challenge is how AI agents can live among us—and for this, implementing wise AI reasoning may be a more promising approach. Aligning AI systems to the right metacognitive strategies rather than to the “right” values might be both conceptually cleaner and more practically feasible. For example, task-level strategies may include heuristics such as a bias toward inaction: When in doubt about whether a candidate action could produce harm according to one of several possibly conflicting human norms, by default do not execute the action. Yet wise metacognitive monitoring and control will be crucial for regulating such task-level strategies. In the ‘inaction bias’ strategy, for example, a requirement is to learn what those conflicting perspectives are and to avoid overconfidence.

Building Wise AI:

Section 4.1 discusses the potential for benchmarking AI wisdom. They seem to be in favour of starting with tasks that measure wise reasoning in humans and scoring their reflections based on predefined criteria[9]. That said, whilst they see benchmarking as a “crucial start” they also assert that ” there is no substitute for interaction with the real world”. This leads them to suggest a slow rollout to give us time to evaluate whether their decisions really were wise.

They also suggest two possibilities for training wise models:

One possibility is a two-step process, first training models for wise strategy selection directly (e.g., to correctly identify when to be intellectually humble) and then training them to use those strategies correctly (e.g., to carry out intellectual humble behavior). A second possibility may be to evaluate whether models are able to plausibly explain their metacognitive strategies in benchmark cases, and then simultaneously train strategies and outputs (e.g., training the model to identify the situation as one that calls for intellectual humility and to reason accordingly; e.g., Lampinen et al., 2022). In either case, models could be trained against what a wise human would do, or perhaps to explain and defend its choices to wise humans robustly (i.e., to stand up to ‘cross-examination’).

One worry I have is that sometimes wisdom involves just knowing what to do without being able to explain it. In other words, wisdom often involves system 1 rather than system 2.

Justification for Building Wise AI

First, it is not clear what the alternative is. Compared to halting all progress on AI, building wise AI may introduce added risks alongside added benefits. But compared to the status quo—advancing task-level capabilities at a breakneck pace with little effort to develop wise metacognition—the attempt to make machines intellectually humble, context-sensitive, and adept at balancing viewpoints seems clearly preferable.

The authors seem to imagine wise AI’s acting directly in the world. In contrast, my primary interest is in wise AI advisors working in concert with humans.

What else does the paper include?

• Page 5 contains a summary of different theories of human wisdom and two attempts to identify common themes or processes

• Section 2.2.1 discusses how wisdom in AI might vary from wisdom in humans given that AI has differing cognitive constraints

• In the final section they suggest that building machines wiser than humans might prevent instrumental convergence[10] as “empirically, humans with wise metacognition show greater orientation toward the common good”. I have to admit skepticism as I believe in the orthogonality thesis and I see no reason to believe it wouldn’t apply to wisdom as well. That said, there may be value in nudging an AI towards being wise in terms of improving alignment, even if it is far from a complete solution.

  1. ^

    They seem to be pointing towards Knightian Uncertainty.

  2. ^

    Non-linearity tor strong sensitivity to starting conditions.

  3. ^

    Such that there isn’t a constant probability distribution to learn.

  4. ^

    They essentially mean intractability.

  5. ^

    They provide some examples at the beginning of section 2 which help justify their focus on metacognition. For example: “Willa’s children are bitterly arguing about money. Willa draws on her life experience to show them why they should instead compromise in the short term and prioritize their sibling relationship in the long term”. Whilst this might not initially appear related to metacognition, I suspect that the authors see this as related to “perspective seeking”, one of the six metacognitive processes they highlight.

  6. ^

    I agree that metacognition seems important for explanability, but my intuition is that wise decisions are often challenging or even impossible to make legible. See Tentatively against making AIs ‘wise’, which won a runner up prize in the AI Impacts Essay competition on the Automation of Wisdom and Philosophy

  7. ^

    Eliezer Yudkowsky’s view seems to be that this specification pretty much has to be exhaustive, though others are less pessimistic about partial alignment.

  8. ^

    The first sentence of this section reads “First, humans are not even aligned with each other”. This is confusing since the second paragraph seems to suggest that their point is more about humans not always following norms, which is what I’ve summarised their point as.

  9. ^

    I’m skeptical that using pre-defined criteria is a good way of measuring wisdom.

  10. ^

    This paper don’t use the term “instrumental convergence”, so this statement involves a slight bit of interpretation on my part.