Previously (see Hanson/Eliezer FOOM debate) Eliezer thought you’d need recursive self-improvement first to get fast capability gain, and now it looks like you can get fast capability gain without it, for meaningful levels of fast. This makes ‘hanging out’ at interesting levels of AGI capability at least possible, since it wouldn’t automatically keep going right away.
Might be good to elaborate on this one a bit, why that might make ‘hanging out’ possible, i.e., diminishing returns. (Though if a substantial improvement can be made by a) tweaks, b) adding another technique or something, then maybe ‘hanging out’ won’t happen.)
Hiding what you are doing is a convergent instrumental strategy.
Amusingly, true on two levels. (Though there’s worry, people won’t converge on that strategy anyway.)
Explanation of above part 2: Corrigibility is ‘anti-natural’ in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior (“consistent utility function”) which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).
Sort of ‘corrigibility’ is ‘Corrigibility without (something like) self-shutdown or self-destruct.’
37. Trying to hardcode nonsensical assumptions or arbitrary rules into an AGI will fail because a sufficiently advanced AGI will notice that they are damage and route around them or fix them (paraphrase).
Is this about strategy/techniques, or reward?
39. Nothing we can do with a safe-by-default AI like GPT-3 would be powerful enough to save the world (to ‘commit a pivotal act’), although it might be fun. In order to use an AI to save the world it needs to be powerful enough that you need to trust its alignment, which doesn’t solve your problem.
And it’s not particularly useful for convincing people to do things like ‘not publish’.
They’re a core thing one could and should ask an AI/AGI to build for you in order to accomplish the things you want to accomplish.
(I expected a caveat here, like ‘if aligned’.)
43. Furthermore, an unaligned AGI powerful enough to commit pivotal acts should be assumed to be able to hack any human foolish enough to interact with it via a text channel.
What are pivotal acts, aside from ‘persuading people’? Nanotech, or is the bar lower?
Proving theorems about the AGI doesn’t seem practical.
Is proving things useful to ‘ai’? Like, in Go, or Starcraft? Or are strategies always not handled that way?
Might be good to elaborate on this one a bit, why that might make ‘hanging out’ possible, i.e., diminishing returns. (Though if a substantial improvement can be made by a) tweaks, b) adding another technique or something, then maybe ‘hanging out’ won’t happen.)
Amusingly, true on two levels. (Though there’s worry, people won’t converge on that strategy anyway.)
Sort of ‘corrigibility’ is ‘Corrigibility without (something like) self-shutdown or self-destruct.’
Is this about strategy/techniques, or reward?
And it’s not particularly useful for convincing people to do things like ‘not publish’.
(I expected a caveat here, like ‘if aligned’.)
What are pivotal acts, aside from ‘persuading people’? Nanotech, or is the bar lower?
Is proving things useful to ‘ai’? Like, in Go, or Starcraft? Or are strategies always not handled that way?