sunwillrise comments on What do coherence arguments actually prove about agentic behavior?

sunwillrise 3 Jun 2024 7:29 UTC
27 points
11
Unlike EJT, I think this is totally fine as a discourse norm, and should not be considered a “mistake”. I also think the title “there are no coherence theorems” is hyperbolic and misleading, even though it is true for a specific silly definition of “coherence theorem”.
I’m not sure about this.
Most of the statements EJT quoted, standing in isolation, might well be “fine” in the sense that they are just imprecise and not grave violations of discourse norms. The problem, it seems to me, lies in what the implications of those imprecise statements are in terms of the community’s understanding of this topic and in the manner in which they are used, by those same important and influential people, to argue for conclusions that don’t seem to me to be locally valid.
If those statements are just understood as informal expressions of the ideas John Wentworth was getting at, namely that when we “assume some arguably-intuitively-reasonable properties of an agent’s decisions”, we can then “show that these imply that the agent’s decisions maximize some expected utility function”, then this is perfectly okay.
But if those statements are mistaken by the community to mean (and used as soldiers to argue in favor of the idea) that “powerful agents must be EU maximizers in a non-trivial sense” or even the logically weaker claim that “such entities would necessarily be exploitable if they don’t self-modify into an EU maximizer”, then we have a problem.
Moreover, as I have written in another comment here, if the reason why someone would think that think that “many superficially appealing solutions like corrigibility, moral uncertainty etc are in general contrary to the structure of things that are good at optimization” is because of intuitions about what powerful cognition must be like, but the source of those intuitions was the set of coherence arguments that are being discussed in the question post, then learning the coherence arguments do not extend as far as they were purported to should cause that person to rethink those intuitions and the conclusions they had previously reached on their basis, as they are now tainted by that confusion.
Now take the exact quote from Eliezer that I mentioned at the top of my post (bolding is my addition):
Except that to do the exercises at all, you need them to work within an expected utility framework. And then they just go, “Oh, well, I’ll just build an agent that’s good at optimizing things but doesn’t use these explicit expected utilities that are the source of the problem!”
And then if I want them to believe the same things I do, for the same reasons I do, I would have to teach them why certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples.
And I have tried to write that page once or twice (eg “coherent decisions imply consistent utilities”) but it has not sufficed to teach them, because they did not even do as many homework problems as I did, let alone the greater number they’d have to do because this is in fact a place where I have a particular talent.
Which category from the above do you think this most neatly fits into? The innocuous imprecise expression of a mathematical theorem, or an assertion of an unproven result that is likely to induce misconceptions into the minds of the readers? It certainly induced me to believe something incorrect, and I doubt I was the only one.
What about the following?
I think that to contain the concept of Utility as it exists in me, you would have to do homework exercises I don’t know how to prescribe. Maybe one set of homework exercises like that would be showing you an agent, including a human, making some set of choices that allegedly couldn’t obey expected utility, and having you figure out how to pump money from that agent (or present it with money that it would pass up).
Like, just actually doing that a few dozen times.
Maybe it’s not helpful for me to say this? If you say it to Eliezer, he immediately goes, “Ah, yes, I could see how I would update that way after doing the homework, so I will save myself some time and effort and just make that update now without the homework”, but this kind of jumping-ahead-to-the-destination is something that seems to me to be… dramatically missing from many non-Eliezers. They insist on learning things the hard way and then act all surprised when they do. Oh my gosh, who would have thought that an AI breakthrough would suddenly make AI seem less than 100 years away the way it seemed yesterday? Oh my gosh, who would have thought that alignment would be difficult?
Utility can be seen as the origin of Probability within minds, even though Probability obeys its own, simpler coherence constraints.
What purpose does the assertion that Eliezer “save[s] [himself] some time and effort and just make[s] that update now without the homework” serve other than to signal to the audience that this is such a clearly correct worldview that if only they were as smart and experienced as Eliezer, they too would immediately understand that what he is saying is completely true? Is it really not a violation of “discourse norms” to use this type of rhetoric when your claims are incorrect as written?
What about “Sufficiently optimized agents appear coherent”? That one, from its very title, is directly and unambiguously asserting something that has not been proven, and in any case is probably not correct (in a non-trivial sense):
Again, we see a manifestation of a powerful family of theorems showing that agents which cannot be seen as corresponding to any coherent probabilities and consistent utility function will exhibit qualitatively destructive behavior, like paying someone a cent to throw a switch and then paying them another cent to throw it back.
I can keep going, but I think the general pattern is probably quite clear by now.
As it turns out, all of the worst examples are from Eliezer specifically, and the one example of yours that EJT quoted is basically entirely innocuous, so I want to make clear that I don’t think you specifically have done anything wrong or violated any norms.
What links here?
- Rohin Shah 3 Jun 2024 18:27 UTC
  5 points
  4
  Parent
  I agree Eliezer’s writing often causes people to believe incorrect things and there are many aspects of his discourse that I wish he’d change, including some of the ones you highlight. I just want to push back on the specific critique of “there are no coherence theorems”.
  (In fact, I made this post because I too previously believed incorrect things along these lines, and those incorrect beliefs were probably downstream of arguments made by Eliezer or MIRI, though it’s hard to say exactly what the influences were.)