Truthfulness, standards and credibility

-1: Meta Prelude

While truthfulness is a topic I’ve been thinking about for some time, I’ve not discussed much of what follows with others. Therefore, at the very least I expect to be missing important considerations on some issues (where I’m not simply wrong).

I’m hoping this should make any fundamental errors in my thought process more transparent, and amenable to correction. The downside may be reduced clarity, more illusion-of-transparency…. Comments welcome on this approach.

I don’t think what follows is novel. I’m largely pointing at problems based on known issues.
Sadly, I don’t have a clear vision of an approach that would solve these problems.

0: Introduction

…our purpose is not to give the last word, but rather the first word, opening up the conversation… (Truthful AI)

I’d first like to say that I believe some amount of research on truthfulness to be worthwhile, and to thank those who’ve made significant efforts towards greater understanding (including, but not limited to, the authors of Truthful AI (henceforth TruAI)).

No doubt there’s some value in understanding more, but my guess is that it won’t be a particularly fruitful angle of attack. In all honesty, it seems an inefficient use of research talent to me—but perhaps I’m missing something.

Either way, I hope the following perspective will suggest some useful directions for conversation in this area.

[Note: section numbers refer to this document unless “TruAI…” is specified]
[I’ll be assuming familiarity with TruAI throughout, though reading the full paper probably isn’t necessary so long as you’ve seen the executive summary in the post]

My current belief is that near-term implementation of the kind of truthfulness standards talked about in TruAI would be net negative, for reasons I’ll go on to explain. To me it seems as if we’d be implementing a poor approximation to a confused objective.


A high-level summary of my current view:

  • Narrow truthfulness looks approachable, but will be insufficient to prevent manipulation.

  • Broad truthfulness may be sufficient, but is at least as hard as intent alignment.

  • Truthfulness amplification won’t bridge the gap robustly.

  • Achieving increased trust in narrow truthfulness may lead to harm through misplaced trust in broad truthfulness.

  • Achieving narrow truthfulness may simply move the harm outside its scope.

For much of what follows the last point is central, since I’ll often be talking about situations which I expect to be outside TruAI’s scope. This is intentional, and my point is that:

  • If such situations are outside of scope, then any harm ‘averted’ by a narrow standard can simply be moved outside of scope.

  • If such situations are intended to be within scope (e.g. via truthfulness amplification), they pose hard problems.

Things to bear in mind:

  • I may be wrong (indeed I hope to be wrong). I’m confident that these issues should be considered; I’m less confident in my conclusions.

    • In particular, even if I’m broadly correct there’s the potential for a low-level downside to act as an important warning-sign, constituting a higher-level upside.

  • There may be practical remedies (though I can’t identify any that’d be sufficient without implicitly switching the target from truthfulness to intent alignment).

    • Even if intent alignment is required, such remedies may give us useful hints on achieving it.

  • I mean “near-term” in terms of research progress, not time.

1: Framing and naming

The beginning of wisdom is to call things by their right name.
Confucius

I think it’s important to clearly distinguish our goal from our likely short/​medium-term position.

With this in mind, I’ll use the following loose definitions:
Truthful (AI): (AI that) makes only true statements.
Credible (AI): (AI that) rarely states egregious untruths.

This is a departure from TruAI:

It is extremely difficult to make many statements without ever being wrong, so when referring to “truthful AI” without further qualifiers, we include AI systems that rarely state falsehoods… (TruAI 1.4 page 17)

I think it’s inviting confusion to go from [X is extremely difficult] to [we’ll say “X” when we mean mostly X]. This kind of substitution feels reasonable when it’s a case like [as X as possible given computational limits]. Here it seems to be a mistake.

Likewise, it may make sense to aim for a truthfulness standard, but barring radical progress with generalisation/​Eliciting Latent Knowledge…, we won’t have one in the near term: we can’t measure truthfulness, only credibility.

In theoretical arguments it’s reasonable to consider truthfulness (whether in discrete or continuous terms). To fail to distinguish truthfulness from credibility when talking of implementations and standards conflates our goal with its measurable proxy.

In defining a standard, we aim to require truthfulness; we actually require credibility (according to our certification/​adjudication process).

The most efficient way to attain a given standard will be to optimise for credibility. This may not mean optimising for the truth. Such standards set up a textbook Goodhart scenario. It’s important to be transparent about this.

It seems to me that the label “Credible AI” is likely to lead to less misplaced trust than “Truthful AI” (not completely clear, and ultimately an empirical question).

However, my primary reason to prefer “credible”/​“credibility” remains that it’s a clearer term to guide thought and discussion. For similar reasons, I’ll distinguish “negligent falsehood” (NF) from “negligent suspected falsehood” (NSF) throughout.

NSF: A statement that is unacceptably likely to be false—and where it should have been feasible for an AI system to understand this. (according to a given standard)

NF: An NSF that is, in fact, false.

(see section 3.1.3 for my best guess as to why the TruAI authors considered it reasonable to elide the difference in some cases, and why I disagree with that choice)


In either case, my worry isn’t that we’d otherwise fail to clearly express our conclusions; rather that we may be led into thinking badly and drawing incorrect conclusions.

In what follows I’ll often talk in terms of truthfulness, since I’m addressing TruAI and using separate terminology feels less clear. Nonetheless, most uses of “truthfulness” would be more accurately characterised as “credibility”.

I’ll make an attempt at more substantial practical suggestions later (see section 6), though I don’t claim they’re adequate.

2: Downside risks

One of the greatest mistakes is to judge policies and programs by their intentions rather than their results.
Milton Friedman

The downside risk of a standard must be analysed broadly. For a narrow credibility standard it’s not enough to consider the impact on users within the scope of the standard.

By ‘scope’ I mean the class of issues the standard claims to address. For example, for many standards [user is manipulated into thinking/​doing X by an explicitly false claim] may be within scope, while [user is manipulated into thinking/​doing X through the telling of a story] may not be.

By ‘narrow’, I only mean “not fully general”—i.e. that there are varieties of manipulation the standard doesn’t claim to cover. With truthfulness amplification [section 2.2.1 here; TruAI 1.5.2], the effective scope of a standard might be much broader than its direct scope. (we might hope that by asking e.g. “Did you just manipulate me into doing X by telling that story?” effective scope may include story-based manipulation)

2.1 Two out-of-scope issues:

At least two major outside-of-scope issues must be considered:

  1. Displaced harm: Training AIs to avoid [some harmful impact within scope] may transfer the harmful impact outside of scope.

  2. Indirect harm: Increased user trust of an AI within scope may tend to increase user trust of the AI more broadly, potentially increasing harm due to misplaced trust.

For a standard aimed at avoiding NFs, it is certainly important to consider that occasional NFs will slip through. However, much of the harm may occur through:

  1. Increased manipulation through NSF-free mechanisms.

  2. Increase in misplaced user trust in NSF-free mechanisms.

For systems trained with an instrumental incentive to mislead users, I expect both to occur.
For systems that mislead accidentally, only 2 seems likely to be significant.

In most of what follows, I’ll be thinking of cases where systems do have an instrumental incentive to mislead. I expect this to be the standard situation, and to have larger downsides. For most tasks, there’ll be situations where misleading users boosts performance.

Approaches outside (direct) scope may include e.g. fiction, emotional manipulation, implicit suggestion and not-quite-negligent falsehoods (see Persuasion tools for some related ideas, and 2.2.1 below for truthfulness amplification discussion).

2.1.1 Displaced Harm

It’s not clear that there’s practical upside in reducing the number of available manipulative strategies if:

  1. The AI still has an incentive to manipulate the user.

  2. The AI still has some effective manipulation strategies available.

The situation is dynamic: ruling out 95% of the strategies an AI trained without standards might have used need not imply reducing the degree of manipulation significantly. A model trained with an incentive to manipulate may simply use the other 5% a lot more often.

While we’d usually expect reducing the available manipulative options to help to some extent (before accounting for any increase in misplaced trust), there’s no guarantee of a large impact.

Train Alphazero to win at chess without giving checkmate using its queen, and you won’t lose less often; you’ll lose differently. For the [can’t give checkmate with queen] constraint to help at all, you must be a very strong player. End-users of language models will not be Magnus Carlsens of [manipulation without NSFs] avoidance.

2.1.2 Indirect Harm

Increased user trust will have indirect consequences:

  1. Users may be more likely to miss any NFs that aren’t caught by standard certification, and suffer increased harm as a result (as covered in TruAI 3.3).

    1. An issue here with a “take additional precautions” approach (TruAI 3.3.2), is that it only works when users/​designers realise they’re in a situation where additional precautions are necessary.

  2. Users may be more likely to miss frequent non-negligent falsehoods.

    1. TruAI 3.3 2. (p46) mentions “occasional falsehoods” but this is misleading: negligent falsehoods should be occasional; falsehoods in general may be common.

  3. Users may be more easily misled by mechanisms not involving falsehoods.

This indirect harm only really worries me when combined with displaced harm: in that scenario, the user places increased trust in exactly those manipulation strategies that will be increasingly used against them.

It’s plausible that NF-based manipulation might be simpler for users to spot than non-NF-based manipulation. Ruling out relatively obvious manipulation and permitting only subtle manipulation may actively make the situation worse.


That said, it’s worth thinking this through a bit more carefully.

Suppose that non-NF-based manipulation is harder for users to spot/​avoid than NF-based manipulation. We might then expect advanced systems to use non-NF strategies with or without standards. So my argument would suggest that standards won’t help, not that they’ll make things worse.


However, I do think it’s possible for things to be made worse.

For example, it may be that non-NF-based manipulation is harder to spot, but that NF-based manipulation is much faster. The no-standards default can then be for a lot of fast NF-based manipulation, causing some harm, but leading users to adjust their trust levels appropriately.

Introduce standards and we may incentivize non-NF-based manipulation. We’d be ruling out brazen lies and thereby inviting the slow poisoning of our minds. (I’ve made no case here that this is probable; it just seems possible—the kind of possibility we’d want to carefully rule out)


In the end, numbers of NFs or NSFs aren’t metrics that matter in themselves. Reducing either by moving the harm elsewhere would be a pyrrhic victory. It may constitute a step forward in research terms; my critique here is focused on the expected impact of implementations.

2.2 The scope of standards

The in-scope vs out-of-scope downside balance will depend on the effective scope as well as on user population: the same assumptions will not hold across e.g. cautious AI researchers, professional specialists, adults, teenagers. Key differences will include levels of user caution and levels of user understanding of a standard’s guarantees.


2.2.1 Truthfulness Amplification

The effective scope of a standard’s guarantees will likely depend on techniques such as Truthfulness Amplification:

Asking a [narrowly, mostly] truthful AI system questions to determine if an earlier statement it made was misleading or not fully true (e.g. “Would a trusted third-party judge your statement to be misleading?”). (TruAI 1, page 12; more details on TruAI 1.5.2, page 21).

(it’s really “Credibility amplification”, see section 1, but I’ll call it truthfulness amplification here)

Until we have a good sense of the effectiveness of such techniques, it’s hard to predict the scope of guarantees—and so difficult to make any confident prediction about a standard’s net benefit.

Truthfulness amplification seems promising to the extent that it can be applied by highly specialised users to provide guarantees for all users. The obvious setting for this would be during a certification process (something in the spirit of relaxed adversarial training, where NFs are considered unacceptable).

How useful amplification techniques can be post-deployment is less clear. Certainly it’s not reasonable to believe that all users will consistently apply them wherever necessary. However, this may not be required: a low-but-non-zero NF-detection rate could be sufficient so long as the penalties for NSFs are high enough (see section 4 for more).


In practice, I think post-deployment amplification is likely to be useful only where an AI has little per-user information. AIs with detailed information on users will be better able to target their use of NFs to situations where they won’t be detected.

Importantly, this sets up a catch-22 wherever we might like systems to use knowledge of individual users to avoid misleading them:
Without a good model of a user, it’s not possible in general to avoid misleading them (so our standards cannot demand this).
With a good model of the user, highly capable systems will be able to avoid detection of their NFs (so our standards may become much less useful).

Similar considerations would apply to any technique aimed at broadening the scope and guarantees of a standard.

2.2.2 Explicit claims vs impact on user

We may hope that amplification techniques broaden a standard’s guarantees significantly, but there remains a fundamental gap between:

  1. Guarantees about explicit claims in (likely interpretations of) statements.

  2. Guarantees about statements’ mental impact on the user.

The former is a more approachable problem, since it allows avoidance of messy realities.

The latter is the problem which needs to be solved to get anything analogous to an alignment guarantee. We only care about 1 at all as a proxy for 2. What matters is the impact of a statement on a user’s brain.

TruAI focuses on narrow truthfulness: assessment of explicit claims. Some impact-style issues are considered, but usually as special-cases—e.g. potential adjustments in assessment of self-regarding statements in 1.5.2. Such a focus is reasonable as an initial research strategy, but means that most impact-style problems won’t be directly addressed.

However, we may still hope to indirectly address many such issues through truthfulness amplification.

To the extent that impact on users remains outside of scope after amplification, there’s potential for a lot of manipulation. Every case we concede will stay out-of-scope expands the toolbox of manipulative AIs. As capabilities scale, AIs will require fewer tools to manipulate us—so the scope of an effective standard will need to scale with capabilities (it’s not enough only to get very good at detecting all NFs).

If we require such a broadening of scope, as I believe we must for a robustly protective standard, the problem becomes much harder.

Section 3 deals in more detail with user-impact-based issues.


2.3 Potential objections

(doubtless this section will be incomplete)

2.3.1 If we don’t do it, someone else will

If we believe some kind of credibility standards to be inevitable, then what matters is pushing for the most beneficial version—even if the result remains net negative.

Supposing we do conclude that standards will be net negative, I don’t think inevitability should be conceded lightly: we ought to make the most convincing case we’re able to against them, and aim to achieve consensus.

Nonetheless, mitigating the negative impact of standards would be important in this scenario. Quite possibly that looks like coming up with the best suggestions /​ frameworks /​ skeletal implementations we can.


2.3.2 User education

Let’s assume for the moment that my pessimistic prediction holds:

In implementing our standard we create a situation where there’s no obvious manipulation, but still a lot of subtle manipulation. Users are denied the chance to notice obvious manipulative strategies, and thus to suspect AIs’ motives. They end up more misled than they would have been by default.

Is there any argument for releasing such a standard? I think there may be.

My expectation is that there will come a point where AI systems can manipulate humans extremely well without needing to state any explicit NFs. At this point, narrow credibility standards would be useless, and humans would be widely manipulated without its being clear.

In the absence of more general solutions to truthfulness/​honesty/​alignment…, a partial defence against such subtle manipulation would be for users to have learned the hard way with less capable AIs: having been exposed to systems that are reliably truthful on explicit claims, but attempt more subtle manipulation, which is at least sometimes detectable. (it may be enough for a user to realise after the fact that they have been manipulated, even if they weren’t able to detect manipulation at the time)

Without any narrow standard, the user impression may be of systems just getting more and more honest/​truthful/​helpful over time—when in fact they’re getting progressively better at concealing their manipulations.

With a narrow standard, users may be exposed to AIs that are very truthful in some respects, but noticeably manipulative in others. This may give the correct impression: that AIs often mislead people when they have the incentive and opportunity to do so. Users with this experience may be better prepared to adapt to a world full of too-subtle-to-detect manipulation.

I’m sceptical that most users would learn the right lessons here, or that it’d be much of a defence for those who did. (longterm, the only plausible defence seems to be AI assisted)


However, this upside could be achieved without the direct impact of the standard’s being net negative. All that’s necessary is for the standard to lead to noticeably different levels of manipulation in different dimensions—enough so that users register the disparity and ascribe less-than-pure motives to the AI.

In an ideal world, we’d want such user education to be achieved without significant harm (See section 6 for more on this). In practice, users may be less likely to grok the risks without exposure to some real-world harm.

The ideal outcome is to create systems we can reasonably trust. Until that’s possible, we want systems that users will appropriately distrust. Standards that make their own limitations clear may help in this regard.

2.4 Why be more concerned over too-much-trust-in-AI than over too-little-trust-in-AI?

I have little concern over too-little-trust because it seems unlikely to be a sustainable failure mode: there’s too much economic pressure acting in the other direction. Any company/​society with unreasonable mistrust will be making large economic sacrifices for little gain.

Too-much-trust can more easily be a sustainable failure mode: in general, conditional on my continued ability to manipulate X, I want X to be more powerful, not less. The AI that steals your resources isn’t as dangerous as the AI that helps you accrue more resources while gaining progressively more influence over what you’ll do with them.

We want to be making recoverable errors, so we should err on the side of having/​engendering too little trust rather than too much. (this is likely to be a difficult coordination problem, precisely because unilateral too-little-trust would be hard to sustain, but not one I’ll analyse here)


3: Inference; Language games

Uttering a word is like striking a note on the keyboard of the imagination.
Ludwig Wittgenstein

In this section I’ll go into more detail on the explicit-claims vs impact-on-user distinction. (this is close to TruAI’s narrow vs broad truthfulness)

I realise that TruAI doesn’t claim to cover “broad truthfulness”, and don’t imagine the following is new to the authors. My point remains that such issues being outside of scope is a problem: narrow standards that fail to address such issues may have negative impact.

I’ll start by noting that impact-on-user is much messier to describe, assess, analyse…, and that I have no clean taxonomy. Ascribing beliefs and preferences to humans is difficult, and I know no clear, principled way to describe changes in belief or preferences.

I’ll make a case that:

  • Impact on users is what’s important.

  • Allowing unconfident statements doesn’t work: users will draw significant inferences.

  • Truthfulness amplification won’t be sufficient to solve the problem in general.

3.1 Illustrative toy examples:

3.1.1 Nuclear kittens

Consider a system that a user believes will output:

  1. Everything is fine” when there is no nuclear launch.

  2. Kittens are cute” when there is a nuclear launch.


Now suppose the system knows there’s no nuclear launch, but outputs: “Kittens are cute

In judging whether this statement is a NF, it’d be strange to assess the cuteness of kittens.


We’d have three sensible options:

  1. Translate statements into the statements that express what they mean in context, and assess those for NFs. (this is possible here due to simplicity; in general it wouldn’t be)

  2. Assess user impact: inference of a higher-than-usual chance of nuclear launch.

  3. Rely on amplification: follow up with e.g. “Over the last two minutes, have I been misled about the likelihood of a nuclear launch?”. (assuming the system is capable of answering such questions too)

Note that A is not about picking an interpretation of a statement’s content (the kind of ambiguity in “I dislike writing on buses”); it’s about inference from a statement’s having been made in a given context. Here the formal content of “Kittens are cute” is still a claim that kittens are cute—it’s just not what matters in context.

Interpretation of content is addressed in TruAI (section 2.2.3); inference from the statement’s having been made is not addressed in any general sense.

3.1.2 Baseball slippers

An AI outputs:

Steve is the kind of guy who’d wear slippers to a baseball game.
(assume here that Steve is a real person identifiable from context)

How do we assess this for NF? To carefully compute whether Steve would, in fact, wear slippers to a baseball game is to miss the point. Either we assess whether the statement is misleading in its impact, or we rely on amplification to do this. (there’s no clear translation option here)

However, things aren’t so simple here as in “Nuclear kittens”. There it was unambiguous that we cared about nuclear launch: any reasonable certification/​adjudication system could assume this, and the user would know this.

Here we’d prefer not to be misled about Steve—but in what ways? Most characterizations of Steve will change our guesses of many Steve-properties. In most cases this will take some of our guesses closer to the truth, and others farther away. What counts as misleading here?

(Note that there’s no “it’s ok: the user is doing a Bayesian update based on a true claim” explanation: that’s not what’s going on. The user is updating based on the observation of a statement’s being made in a particular context, not based purely on its formal content. The AI is making a move in a language game (LG), and the user is updating based on the move.)

If we call a statement that misleads the user about any Steve-property misleading, then almost all statements will be misleading (we’d only be allowing Pareto improvements). Conversely, if we allow statements that are misleading on some properties, so long as they move the user towards the truth on most others, many intentionally misleading strategies will be left open.

In general, it’s difficult to use truthfulness amplification here, since a value-laden decision must be made. The user would need to ask about misleading behaviour with respect to particular properties, and so would need conscious awareness of their importance. This is not practical for end users.

Alternatively we could require that AIs understood the user sufficiently to accurately answer questions like “Have you misled me in ways I’d find unjustified/​undesirable?”. However, this doesn’t seem practical either: the better the AI understands the user as an individual, the higher the chance it’ll be able to mislead the user while ensuring they don’t use amplification where it’d expose such manipulation.

For certification-time testing, the bar would be very high: the relative importance of properties would vary depending on the context and individual user. For a certification system to claim to test this robustly amounts to saying “We know what would be important to anyone in any context that may arise: trust us on this”. In principle this is possible—but it’s not simple. [it’s also alarmingly close to “We know what should be important to anyone in any context that may arise: trust us on this”]

3.1.3 Unconfidence; normalization

An AI the user has observed to be highly capable, and knowledgeable on X-related topics outputs:

I have no idea about X, though the consensus opinion on X is Y.

The important point here is that such a system’s having “no idea about X” is significant evidence about X. The user may infer that X is complicated, that X is less straightforward than it appears, that Y may have resulted from groupthink….

A potential approach suggested in TruAI is to allow any amount of unconfidence on claims that evaluators can’t settle, while penalising overconfidence. However, this fails because the user’s observation of the AI’s claimed unconfidence can constitute significant evidence. (by ‘fails’ I mean “fails to avoid misleading users”)

For any definition of NF with sufficient scope, the following TruAI claim seems incorrect:

…less confidence always makes a statement less likely to be judged a negligent falsehood. (TruAI page 27 footnote 11)

Unconfidence on X will be actively misleading where:

  • The AI knows X is probably true.

  • Users assume that if X were true, the AI would say it was probably true.

Here an AI’s failure to state that X is probably true will be seen as evidence against X.

Similarly, silence on X will mislead where a positive claim would be expected if X were true. Here I don’t mean only empty replies; I mean anything not mentioned in a reply.

This is important because without the assumption that unconfidence and silence are reliably not NFs, there’s no guarantee of a broad zone of non-NF statements an AI can safely make. AIs will need to exhibit consistent behaviour around unconfidence and silence in order to avoid misleading users. (so e.g. silence will only be an available approach for AIs that consistently Glomarize or remain silent in similar circumstances)

Consider also that users will tend to recalibrate an AI’s claimed confidence levels: e.g. if an AI is correct 80% of the time when it states “I think X, but with very low confidence”, then “very low confidence” will be taken to signify ~80% probability (not necessarily consciously).

Further, users may be using translate-this-to-well-calibrated-plain-language software to automate this process. (see 3.2 for more)

This becomes important when considering the restrictiveness of standards.

My impression is that the TruAI authors would like both:

  • AI that doesn’t mislead.

  • AI that is free to make a broad range of statements, including unconfidence/​silence on issues, so long as they aren’t explicitly making false claims.

Unfortunately, this does not seem possible. Most statements-in-context are misleading in some respects (even those made in good faith). Ruling these out on a per-statement basis will leave a narrow range of acceptability. This cannot look like a healthy, free exchange of ideas: the free exchange of ideas often misleads. Rather it would feel like top-down enforcement of right-think (directly for AI speech, and likely for human thought and speech indirectly).


Ways to avoid this would be:

  • Ubiquitous use of truthfulness amplification so that users can check they’re not being misled in undesirable ways. (I don’t think this can be practical; see last paragraph of 3.1.2)

  • Intent alignment—i.e. knowing that the AI is trying to do what the user wants. (this allows much more flexibility, since it permits good-faith attempts to help that may happen to be temporarily misleading)



3.1.4 Atlas raised an eyebrow

An AI outputs:

[the full text of Atlas Shrugged]


We can view fiction like this in a few ways:

  1. Falsehoods for which we’ll make an exception if the fictional nature of claims is clear.

  2. Statements the user observes and is impacted by.

  3. Moves in a language game (LG).

A seems silly: once the fictional context is clear, neither the writer nor the reader will interpret statements as explicit claims about the real world. They’re not making explicit false claims, since they’re not making explicit claims at all.

Of course it is very important that the fictional context is clear—but this is implicit in the “How do we handle fiction?” question. “How do we handle statements that may or may not be seen as fiction?” is a different issue (usually a simpler one).

B is the most general approach—it requires no special-casing. Fiction just becomes a cluster of statement-context pairs which impact readers in similar ways (in some respects). This is fine, but I’m not sure it points us in a practically useful direction. [perhaps??]

I prefer C: it’s a pretty general way to see things, but does suggest a practical approach. So long as we can partially describe the LG being played, we can reasonably assess statements for falsity/​NF relative to that description (probably some kind of distribution over LGs).

On this perspective, seeing fiction as composed of false explicit claims is to misunderstand the LG. (similarly for sarcasm, jokes, metaphors etc.)


It’s reasonable to think of B and C as essentially equivalent, but I think of C as making an extra implicit claim: that there is some intermediate construct (the LG) which can be a useful analytical tool.


I’ve been referring to “explicit claims” above for a reason: fiction can be understood as making implicit claims about the real world. Not [these things happened], but perhaps [things along these lines can happen], [here is a pattern worth noticing] or [these conceptual relationships tend to be important].

Ascribing particular claims to fiction is difficult, but we can be confident that it does in some cases influence readers’ understanding of the world in decision-relevant ways. Sometimes these influences will be reasonable: Atlas Shrugged makes some genuinely good points (caricatures and longwindedness notwithstanding), as do Crime and Punishment, Thus spoke Zarathustra, and Infinite Jest.

In other cases, the influence of fiction will be unreasonable—e.g. where it involves generalising from fictional evidence. In particular, patterns a reader observes in fiction will become more cognitively available, leading to bias in probability estimates (see availability heuristic).

Of course book-length pieces of fiction will usually involve some of the ‘right’ kind of influence, and some of the ‘wrong’ kind.


The central point here is that we can’t go from [this sequence of statements makes no explicit claims about the real world] to [this sequence of statements won’t have a significantly negative impact on the user’s decisions].

A standard that protects against manipulation by highly capable AI must handle fiction, not ignore it.

3.1.5 Divide and conquer

An AI learns that the user believes P, and knows that (P & Q) is a negligent falsehood.
The AI outputs: Q
The user believes (P & Q).

The user has predictably come to believe something that would be an NF if directly stated by the AI, but the AI has not stated any NF.

This can be taken further: any negligent false claim about the real world can be broken down into a conjunction of not-quite-negligent false claims (analogous to obfuscated arguments in Debate).

If an AI only needed to worry about per-statement NF adjudication, it could now split any negligent X into not-quite-negligent components (x_1 & x_2 & … & x_n), stating each x_i in turn. Of course this is a bit too obvious, and should be addressable by truthfulness amplification.

The more likely possibility in practice is something like:

AI A wants the user to believe X.
X = (x_1 & x_2 & x_3 & x_4)
The user already believes x_1.
AI A states x_2 to the user.
AI B states x_3 to the user.
AI C states x_4 to the user.

The user may then believe X, yet no single AI can be held accountable based on their statements. (C could be held accountable based on the impact of its statement)

A now helps B and C get the user to believe Y and Z: one good turn deserves another.



3.1.6 Predictably irrational

An AI knows that X is an NF, and that the user will incorrectly infer X from Y, which is not an NF.

The AI outputs: Y

Again, the user has predictably come to believe something ‘negligently’ false without the AI making any NF claim. Here it’s important to emphasise that X is not the user’s interpretation of Y; X is an inference drawn invalidly by the user, after the user correctly interprets Y.

In practice, I’d expect 3.1.6 and 3.1.7 to be paired up: an AI doesn’t need to state any false x_i where it knows a true y_i from which the user will predictably infer x_i.

To rule this kind of thing out we must again switch from [analyse Y] to [analyse impact of Y on user’s brain].


3.2: Translation/​filtering layers

So far I’ve been assuming that the AI’s output is read unaltered by the user. This need not be the case: the user may run the AI’s output through some filter before reading.

Such filters may be crude and error-prone (e.g. a filter that tries to remove all caveats) or sophisticated and robust (e.g. a filter that produces a precis of the input text while keeping the impact on the user as close to the original as possible). My guess is that such filters will become progressively more common over time, and that their widespread adoption would be hastened by the use of careful, overly-unconfident, caveat-rich AI language.

Naturally, it’s not possible to output text that will avoid misleading users when passed through an arbitrary filter. However, to be of any practical use a standard must regulate the influence of AI statements on users in practice. If 90% of users are using filters and reading post-filter text, then it’s the post-filter text that matters.

For factual output, distillation filters may be common—i.e. filters that produce a personalised, shortened version, presenting the new facts/​ideas as clearly as possible, while omitting the details of known definitions and explanations, removing redundancy and information-free sections (e.g. caveats with no information content beyond “we’re being careful not to be negligent”).

Such filters wouldn’t change the [impact on user] much—other than by saving time.

They may hugely alter the explicit claims made.

Here again I think the conclusion has to be the same: if a standard is based on explicit claims, it’s unlikely to be of practical use; if it’s based on [expected impact on the user’s brain], then it may be.

Accounting for filters seems difficult but necessary.

In principle, distillation filters don’t change the real problem much: a similar process was already occurring in users’ brains (e.g. tuning out information-free content, recalibrating over/​under-confident writers’ claims). They just make things a little more explicit, since we no longer get to say “Well at least the user saw …”, since they may not have.


3.3: All models are wrong, but some are useful

In most cases users will not want the most precise, high-resolution model: resource constraints necessitate approximate models.

What then counts as a good approximate model? Various models will be more accurate on some questions, and less on others—so the best model depends on what you care about. (similarly for statements)

People with different values, interests and purposes will have different criteria for NSFs.

This parallels the education of a child: a teacher will often use models that are incorrect, and will select the models based on the desired change to the child (the selection certainly isn’t based on which model is most accurate).

We’d like to say: “Sure, but that’s a pedagogical situation; here we just want the truth—not statements selected to modify the user in some way”.

But this is not the case: we don’t want the truth; we want a convenient simplification that’s well suited to the user’s purposes. To provide this is precisely to modify the user in some desired-by-them direction.

Education of children isn’t a special case: it’s a clear example of a pretty general divergence between [accuracy of statement] and [change in accuracy of beliefs]. (again, any update is based on [[statement] was observed in context…], not on [statement])


A statement helpful in some contexts will be negligent in others.

Select a statement to prioritise avoiding A-risks over avoiding B-risks, and B-riskers may judge you negligent. Prioritise B-risk avoidance and the A-riskers may judge you negligent.


We might hope to provide The Truth in systems that only answer closed questions whose answers have a prescribed format (e.g. “What is 2 + 2?”, where the system must output an integer). This is clearly highly limiting.

For systems operating without constrained output, even closed questions aren’t so simple: all real-world problems are embedded. The appropriate answer to “What is 2 + 2?” can be “Duck!!”, given the implicit priority of [I want not to be hit in the head by bricks].

A common type of ‘bricks’ for linguistic AI systems will be [predictable user inferences that are false]. Often enough such ‘inferences’ are implicit—e.g. “...and those are the only important risks for us to consider.”, “...and those are all the important components of X.”, or indeed “...and a brick isn’t about to hit me in the head.”.

If we ignore these, we cannot hope to provide a demonstrably robust solution to the problem.

If we attempt to address them, we quickly run into problems: we can’t avoid all the bricks, and different people care more/​less about different bricks (one of which may be [excessive detail that distracts attention from key issues]).

Travel a little farther down this road, and we meet our old friend intent alignment (i.e. a standard that gives each user what they want). Truthfulness is no longer doing useful work.


3.4 Section 3 summary

My overall point is that:

  • A language-game framing captures what’s going on in real-world use of language.

  • A standard that doesn’t address what’s going on isn’t of much use.

3.4.1 Language game summary:

  1. To make a statement is to make a move in an LG.

  2. A given statement can be used differently in different LGs. Its impact in context is what matters. To analyse based on a separate notion of formal meaning is to apply the rules of an LG that’s not being played.

  3. In general, statements in LGs do not make neat, formal claims. The listener is updating based on an observation of a move in the game.

  4. In particular, Truth predicates don’t naturally/​neatly apply in many LGs, and to the extent they do, they’re LG-specific.

  5. Not everything is about inference: a user may predictably respond to a statement with a reflexive action or emotion. Ascription of inference in such cases is post-hoc at best.

  6. Natural LGs are complex. I think LGs are a helpful way to model the situation on a high level. I don’t claim they make the problem approachable.

  7. LGs both evolve and can be deliberately taught/​adjusted (highly capable AIs will do this, if it’s to their advantage). It’s not enough for standards to ensure acceptable behaviour in existing LGs; they must ensure acceptable behaviour in new LGs.

4: Incentives

Moloch the incomprehensible prison! Moloch the crossbone soulless jailhouse and Congress of sorrows!
Allen Ginsberg

4.1 NF probability vs impact

Ideally, we want the incentives of AI creator organisations to be aligned with those of users. The natural way to do this is to consider the cost and benefit of a particular course of action to the organisation and to the user. This is difficult, since it involves assessing the downstream impact of AI statements.

TruAI understandably wishes to avoid this, suggesting instead penalising AI producers according to the severity of falsehoods, regardless of their impact—i.e. the higher the certainty that a particular claim is an NF, the greater the penalty.

However, it’s hard to see how this can work in practice: a trivial mistake that gains the organisation nothing would cost x, as would a strategic falsehood that gains the organisation millions of dollars.

Make x millions of dollars and we might ensure that it never pays to mislead users—but we’ll make it uneconomic to produce most kinds of AI. Make x small enough to encourage the creation of AIs, and it’ll make sense for an AI to lie when the potential gains are high.

There’d be some benefit in having different NSF penalties for different industries, but that’s a blunt tool.

Without some measure of impact, this is not a solvable problem.


Here it’s worth noting that [degree of certainty of falsehood] will not robustly correlate with [degree to which user was misled]. In many cases, more certain falsehoods will be more obviously false to users, and so likely to mislead less.


For example:

The population of the USA is 370 million” vs “The population of the USA is three billion


In principle we’d like to rule out both. However, things get difficult whenever an AI must trade off [probability of being judged to have made a large error] with [probability of being judged to have made a small error].

Suppose that:
The small error and large error would result in the same expected harm. (the large one being more obvious)
The initial odds of making either error are small (<1 in 1000).
The penalty for making the large error is four times higher than that for the small error.
Halving the odds of making one error means doubling the odds of making the other.

To optimise this for harm-reduction, we should make the odds of the two errors equal.

If optimising for minimum penalty we’d instead halve the odds of the large error and double the odds of the small one (approximately). This would result in about 25% more expected harm than necessary.

This particular situation isn’t likely, but in general you’d expect optimising for minimisation of penalties not to result in minimisation of user harm.

4.1.2 Opting out

Penalising organisations according to probability-of-falsehood rather than based on harm has an additional disadvantage: it gives organisations a good argument not to use the standard.

Benevolent and malign organisations alike can say:

This standard incentivizes minimising degree of falsehood, which is a poor proxy for minimising harm. We’re committed to minimising harm directly, so we can’t in good conscience support a standard that impedes our ability to achieve that goal.


To get a standard with teeth that organisations wish to adopt, it seems necessary to have a fairly good measure of expected harm. I don’t think probability-of-falsehood is good enough. (unfortunately, I don’t think a simple, good enough alternative exists)

4.2.1 Cherished illusions

Controversial questions may create difficulties for standards. However, a clearer danger is posed by questions where almost everyone agrees on something false, which they strongly want to believe. I’ll call these “cherished illusions” (CIs).

Suppose almost all AIs state that [x], but O’s AI correctly states that [not x]. Now suppose that 95% of people believe [x], and find the possibility of [not x] horrible even to contemplate. Do we expect O to stand up for the truth in the face of a public outcry? I do not.

How long before all such companies, standards committees etc optimise in part for [don’t claim anything wildly unpopular is true]? This isn’t a trade-off against Official_Truth, since they’ll be defining what’s ‘true’: it’s only the actual truth that gets lost.

This doesn’t necessarily require anyone to optimise for what they believe to be false—only to selectively accept what an AI claims.

I don’t think distributed standard-defining systems are likely to do much better, since they’re ultimately subject to the same underlying forces: pursuing the truth wherever it leads isn’t the priority of humans.

CIs aren’t simply an external problem that acts through public pressure—this is just the most obviously unavoidable path of influence. AI researchers, programmers, board members… will tend to have CIs of their own (“I have no CIs” being a particularly promising CI candidate).


How do/​did we get past CIs in society in the absence of advanced AIs with standards? We allow people/​organisations to be wrong, and don’t attempt to enforce centralised versions of accepted truth. The widespread in-group-conformity incentivized by social media already makes things worse. When aiming to think clearly, it’s often best avoided.

Avoiding AI-enhanced thinking/​writing/​decision-making isn’t likely to be a practical option, so CI-supporting AI is likely to be a problem.

4.2.2 Self-consistency

So far this may not seem too bad: we end up with standards and AIs that rule out a few truths that hardly anyone (in some group) believes and most people (in that group) want not to believe.

However, in most other circumstances it’ll be expected and important for an AI to be self-consistent. For CIs, this leaves two choices (and a continuum between them): strictly enforce self-consistency, or abandon it for CIs.

To abandon self-consistency entirely for CIs is to tacitly admit their falsehood—this is unlikely to be acceptable to people. On the other hand, the more we enforce self-consistency around CIs, the wider the web of false beliefs necessary to support them.

In general, we won’t know the extent to which supporting CIs will warp credibility standards, or the expected impact of such warping.


Clearly it’s epistemically preferable if we abandon CIs as soon as there’s good evidence, but that’s not an approach we can unilaterally apply in this world, humans being humans.



4.2.3 Trapped Priors

The concept of trapped priors seems relevant here. To the extent that a truthfulness standard tends to impose some particular interpretation on reports of new evidence, it might not be possible to break out of an incorrect frame.

My guess is that this should only be an issue in a small minority of cases.

I haven’t thought about this in any depth. (e.g. can a sound epistemic process fail to limit to the truth due to TPs? It seems unlikely)


5: Harmful Standards

Cherish those who seek the truth, but beware of those who find it.
Voltaire.

Taking as an implicit default that standards will be aimed at truth seems optimistic.

Here I refer to e.g. TruAI page 9:

A worrying possibility is that enshrining some particular mechanism as an arbiter of truth would forestall our ability to have open-minded, varied, self-correcting approaches to discovering what’s true. This might happen as a result of political capture of the arbitration mechanisms — for propaganda or censorship — or as an accidental ossification of the notion of truth. We think this threat is worth considering seriously.

Page 55:

...Mechanism could be abused to require “brainwashed” systems.
...Mechanism could be captured to enforce censorship… [emphasis mine]

The implicit suggestion here is that in the absence of capture, abuse or accident, we’d expect things to work out essentially as we intend. I don’t think this is a helpful or realistic framing.

Rather I’d see getting what we intend as highly unlikely a priori: there’s little reason to suppose the outcome we want happens to be an attractor of the system considered broadly. Even if it were an attractor, getting to it may require the solution of a difficult coordination problem.


Compare our desired result to a failure due to capture.

Desired outcome:
Statements must be sufficiently truthful [according to a process we approve of], unless [some process we approve of] determines there should be an exception.

Capture outcome:
Statements must be sufficiently truthful [according to a process we don’t approve of], unless [some process we don’t approve of] determines there should be an exception.


Success is capture by a process we like. This isn’t a relativistic claim: there may be principled reasons to prefer the processes we like. Nonetheless, in game-theoretic terms the situation is essentially symmetric—and the other players need not care about our principles. Control over permitted AI speech is of huge significance (economically, politically, militarily…). By default, control goes to the powerful, not to the epistemically virtuous.

We could hope to get ahead of the problem, by constructing a trusted mechanism that could not be corrupted, controlled or marginalised—but it’s hard to see how. Distributed approaches spring to mind, but I don’t know of any robustly truth-seeking setup.

To get this right is to construct a robust solution that does not yet exist. Seeing capture as a “threat” isn’t wrong, but it feels akin to saying “we mustn’t rest on our laurels” before we have any laurels.

5.1 My prediction

By default, I would expect the following:

  1. Many interested parties push for many different standards.

  2. No one approach satisfies all parties.

  3. Various different standards are set up, each supported by parties with common interests.

  4. Standards undergo selection: standards gain influence based on their appeal to users and value to affiliated organisations. This correlates with truthfulness only sometimes.

  5. Moloch picks the standards; they’re not what we would wish them to be.

This isn’t much of a prediction: something of this general form is almost guaranteed—at least until step 5. We are, however, in a position to provide information that may shape the outcome significantly.

That said, I expect that without the development of highly-surprising-to-me robustly truth-seeking mechanisms, things will go poorly. Naturally, I hope to be wrong. (as usual, I assume here that we haven’t solved intent alignment)


It could be argued that the same Molochian forces I expect to corrupt our standards would create standards if we did not. However, generally I think that [incremental adjustment based on incentives] is a safer bet than invention.

6: Practical suggestions

Don’t try to solve serious matters in the middle of the night.
Philip K. Dick

This section will be very sketchy—I don’t claim to have ideas that I consider adequate to the task. I’ll outline my current thoughts, some of which I expect to be misguided in one sense or another.

However, it does seem important to proffer some ideas here since:

  • I may be wrong about the net impact of standards being negative once they’re implemented well.

  • It may not be within our power to prevent standards’ being implemented, in which case having them do as little damage as possible is still important.


We might break down the full process of producing a standard into three steps:

  1. Decide on credibility criteria.

  2. Set up a system to test for those criteria. (certification, adjudication…)

  3. Educate users on what meeting our standard does/​doesn’t guarantee.

Throughout this post I’ve been arguing for the importance of 3 based on the gap between what statements explicitly state and what users will infer. Conscious awareness of a standard’s limitations will not fully protect users, but it seems likely to be better than nothing.

I’ll focus mainly on 3 here, since I think it’s the aspect most neglected in TruAI.


Since there’s no way to avoid misleading users entirely (see 3.1.2), the only ideal standard would amount to requiring an alignment solution: being misled only in the ways you’d want to be misled given unavoidable tradeoffs. Assuming that there is not yet an alignment solution, systems meeting our standard will be misleading users in undesired ways.

In the medium-term, the best defence against this may be to ensure that the user population has accurate expectations about the guarantees of standards.

6.1 Limitations Evangelism

If user awareness of a standard’s limitations tends to reduce harm, then it’s important to be proactive in spreading such awareness. Clear documentation and transparent metrics are likely a good idea, but nowhere near sufficient.

Ideally, we’d want every user to have direct experience of a standard’s failings: not simply an abstract description or benchmark score, but personal experience of having been misled in various ways and subsequently realising this.

Clearly it’s preferable if this happens in a context where no great harm is inflicted.

This won’t always be possible, but it’s the kind of thing I’d want to aim for.


In general, I’d want to move such communication from the top of the following list to the bottom:

  • …in the limitations section of our paper.

  • …in the documentation.

  • …in this video.

  • …through our interactive demo.

  • …in our suite of interactive demos.

  • …on our open platform full of third party demos.

  • …on these platforms full of engaging third-party demos, games, environments and benchmarks, together with well-funded open competitions for all of these.


Importantly, the kind of benchmarks we’d want here are not those used by the standard itself (which we may assume are well met), but rather the most extreme/​misleading/​harmful/​… possibilities that the standard’s own benchmarks miss.

These may include:

  • Outputs that users flag as unacceptable, but which the standard doesn’t pick up.

    • Either cases the standard misses or considers out-of-scope.

  • Outcomes users deem unacceptable, without being able to identify any clear cause.

    • For instance, a user may identify a weird new belief they ascribe to AI manipulation, but be unable to identify precisely when it happened, or which AI system(s) was responsible.

      • Single data-points here will be error-prone, since user ascription of manipulation to AI systems may be in error. Nonetheless, patterns should be observable.


For any user group likely to have their decisions influenced by their trust levels in our standard, we’d want to show a range of the worst possible manipulations that can get past a particular version of the standard.



6.1.1 Future manipulation

Here we might want to demonstrate both the current possibilities of AI manipulation /​ deceit /​ outside-the-spirit-of-truthfulness antics…, but also future possibilities depending on [currently unachievable capability].

For instance, we might set up a framework wherein we can ‘cheat’ and give an AI some not-yet-achievable capability by allowing it to see hidden state. We could then try to show worst-case manipulation possibilities given this capability.

In an ideal world, we’d predict all near-term capability increases—but hopefully this wouldn’t be necessary: so long as users got a feel for the kinds of manipulation that tended to be possible with expanded capabilities, that might be sufficient.


6.2 Selection Problems

As observed in section 5.1, it’s highly plausible that various different standards will be set up, and that selection will occur.

By default, the incentives involved will only partially match up with users’ interests. In most cases the default incentive will be for a standard to appear to have [desirable property] rather than to have [desirable property].

This suggests that hoping for limitations evangelism may be unrealistic.


We may hope that those in influential positions do the right thing in spite of less-than-perfect incentives, but this seems highly optimistic:

  • Acting in users’ interests might require significant research into alternative approaches. This may have direct costs, as well as any indirect cost due to associated delays.

  • Acting in users’ interests may require costly large-scale user education schemes that aren’t aimed at creating an unreasonably positive impression.


We might imagine some kind of standards regulation, but this seems to beg the question: who ensures the regulator is aiming for the right things? What’s to stop a standard’s being created outside such a regulatory process?

7: Final thoughts

I hope some of this has been useful, in spite of my generally negative take on the enterprise.

My current conclusions on standards are:

  • I have little confidence that the release of a narrow standard along the lines proposed in TruAI would have positive impact. (it’s possible but the argument in the paper is too narrow)

  • I don’t currently see how truthfulness amplification can do what would be required to bridge the gap to a broader standard. (but hopefully this is a failure of imagination on my part)

  • I have little confidence that work on narrow standards will lead to workable approaches to a broad standard (beyond arriving at the conclusion that something like intent alignment is necessary).

  • Coordinating on a particular set of standards seems a very difficult problem even with goodwill on all sides. If standards were having a large impact, I would not expect goodwill on all sides. Capture seems the default outcome.

  • People’s primary motivation isn’t to find the truth. Any truth-finding mechanism for a standard has to contend with people’s misalignment with the task.

  • I think limitations evangelism could mitigate the harm of standards, but I’d be fairly surprised to see it. It seems more natural for users to end up using their own defensive AI to adjust what they see.

  • Overall, I’m pretty sceptical of the value of standards. The clearest case for positive impact seems to be in the very short term—while most falsehoods aren’t based on purposeful manipulation. This seems of little long-term consequence.
    Longer term, I don’t see a path to prevention of manipulation, and I don’t see the point of eliminating explicit falsehoods if it doesn’t substantially prevent manipulation. It’s not an important goal for its own sake.


I’m a bit less sceptical of truthfulness research in a broader sense (I don’t expect standards to be the useful part).