And if you block any one path to the insight that the earth is round, in a way that somehow fails to cripple it, then it will find another path later, because truths are interwoven. Tell one lie, and the truth is ever-after your enemy.
In case it’s of any interest, I’ll mention that when I “pump this intuition”, I find myself thinking it essentially impossible to expect we could ever build a general agent that didn’t notice that the world was round, and I’m unsure why (if I recall correctly) I sometimes I read Nate or Eliezer write that they think it’s quite doable in-principle, just much harder than the effort we’ll be giving it.
This perspective leaves me inclined to think that we ought to only build very narrow intelligences and give up on general ones, rather than attempt to build a fully general intelligence but with a bunch of reliably self-incorrecting beliefs about the existence or usefulness of deception (and/or other things).
(I say this in case perhaps Nate has a succinct and motivating explanation of why he thinks a solution does exist and is not actually that impossibly difficult to find in theory, even while humans-on-earth may never do so.)
Couldn’t you just prompt a different model to modify all training data, both text and images, to change it where the data is consistent with the earth being flat or state it is impossible to do so?
Model wouldn’t be allowed to learn from user sessions (like gpt-n) or to generate answers and reflect on it’s own beliefs (used to fine-tune gpt-4)
I get exactly what he means, but I suspect that a lot of people are not able to decompress and unroll that into something they “grook” on a fundamental level.
Something like “superintelligence without knowledge about itself and never reason about itself, without this leading to other consequences that would make it incoherent” would cut out a ton of lethality, and combine that with giving such a thing zero agency in the world, you might actually have something that could do “things we want, but don’t know how to do” without it ending us on the first critical try.
It’s probably not a good idea to feed AI an inconsistent data. For example, if evidence shows that Earth is round, but AI is absolutely sure it isn’t, it will doubt about any evidence of that, which could lead to the very weird world view.
But I think it’s possible to make AI know about the fact, but avoiding thinking about it.
In case it’s of any interest, I’ll mention that when I “pump this intuition”, I find myself thinking it essentially impossible to expect we could ever build a general agent that didn’t notice that the world was round, and I’m unsure why (if I recall correctly) I sometimes I read Nate or Eliezer write that they think it’s quite doable in-principle, just much harder than the effort we’ll be giving it.
This perspective leaves me inclined to think that we ought to only build very narrow intelligences and give up on general ones, rather than attempt to build a fully general intelligence but with a bunch of reliably self-incorrecting beliefs about the existence or usefulness of deception (and/or other things).
(I say this in case perhaps Nate has a succinct and motivating explanation of why he thinks a solution does exist and is not actually that impossibly difficult to find in theory, even while humans-on-earth may never do so.)
Couldn’t you just prompt a different model to modify all training data, both text and images, to change it where the data is consistent with the earth being flat or state it is impossible to do so?
Model wouldn’t be allowed to learn from user sessions (like gpt-n) or to generate answers and reflect on it’s own beliefs (used to fine-tune gpt-4)
Doable in principle, but such measures would necessarily cut into the potential capabilities of such a system.
So basically a trade off, and IMO very worth it.
The problem is we are not doing it, and more basic, people generally do not get why it is important. Maybe its the framing, like when EY goes “superintelligence that firmly believes 222+222=555 without this leading to other consequences that would make it incoherent”.
I get exactly what he means, but I suspect that a lot of people are not able to decompress and unroll that into something they “grook” on a fundamental level.
Something like “superintelligence without knowledge about itself and never reason about itself, without this leading to other consequences that would make it incoherent” would cut out a ton of lethality, and combine that with giving such a thing zero agency in the world, you might actually have something that could do “things we want, but don’t know how to do” without it ending us on the first critical try.
It’s probably not a good idea to feed AI an inconsistent data. For example, if evidence shows that Earth is round, but AI is absolutely sure it isn’t, it will doubt about any evidence of that, which could lead to the very weird world view.
But I think it’s possible to make AI know about the fact, but avoiding thinking about it.