I suppose modelling a superintelligent agent as a utility maximizer feels a bit weird but not the weirdest thing, and I’m not sure I can mount a good defense saying that a superintelligent agent definitely wouldn’t be aptly modeled by that.
More importantly, the 3-step toy model with A1, O, A2 felt like a strange and unrelated leap
I don’t know if it’s about the not having an answer part. That is probably biasing me. But similar to the cryptography example, if someone defined what security would mean, let’s say Indistinguishability under chosen plain text attack. And then proceeded to say “I have no idea how to do that or if it’s even possible.” Then I would still consider that real even though they didn’t give us an answer.
Looking at the paper makes me feel like the authors were just having some fun discussing philosophy and not “ah yes this will be important for the fight later”. But it is hard for me to understand why I feel that way.
I am somewhat satisfied by the cryptography comparison for now but definitely hard to see how valuable this is as opposed to general interpretability research.
From a pure world-modelling perspective, the 3 step model is not very interesting, because it doesn’t describe reality. It’s maybe best to think of it from an engineering perspective, as a test case. We’re trying to build an AI, and we want to make sure it works well. We don’t know exactly what that looks like in the real world, but we know what it looks like in simplified situations, where the off button is explicitly labelled for the AI and everything is well understood. If a proposed AI design does the wrong thing in the 3-step test case, then it has failed one of its unit tests, and should not be deployed to production (the real world). So the point of the paper is that a reasonable-sounding way you could design an AI with an off switch turns out to fail the unit-test.
I do generally think that too many of the AI-related posts here on LessWrong are “not real” in the way you’re suggesting, but this paper in particular seems “real” to me (whatever that means). I find the most “not real” posts are the verbose ones piled high with vague wordy abstractions, without an equation in sight. The equations in the corrigiblity paper aren’t there to seem impressive, they’re there to unambiguously communicate the math the paper is talking about, so that if the authors have made an error of reasoning, it will be as obvious as possible. The ways you keep something in contact with reality is checking either against experiment, or against the laws of mathematics. To quote Feynman, “if it disagrees with experiment, it’s wrong” and similarly, there’s a standard in mathematics that statements must be backed up by checkable calculations and proofs. So long as the authors are holding themselves to that standard (and so long as you agree that any well-designed AI should be able to perform well in this easy test case), then it’s “real”.
I suppose modelling a superintelligent agent as a utility maximizer feels a bit weird but not the weirdest thing, and I’m not sure I can mount a good defense saying that a superintelligent agent definitely wouldn’t be aptly modeled by that.
More importantly, the 3-step toy model with A1, O, A2 felt like a strange and unrelated leap
I don’t know if it’s about the not having an answer part. That is probably biasing me. But similar to the cryptography example, if someone defined what security would mean, let’s say Indistinguishability under chosen plain text attack. And then proceeded to say “I have no idea how to do that or if it’s even possible.” Then I would still consider that real even though they didn’t give us an answer.
Looking at the paper makes me feel like the authors were just having some fun discussing philosophy and not “ah yes this will be important for the fight later”. But it is hard for me to understand why I feel that way.
I am somewhat satisfied by the cryptography comparison for now but definitely hard to see how valuable this is as opposed to general interpretability research.
From a pure world-modelling perspective, the 3 step model is not very interesting, because it doesn’t describe reality. It’s maybe best to think of it from an engineering perspective, as a test case. We’re trying to build an AI, and we want to make sure it works well. We don’t know exactly what that looks like in the real world, but we know what it looks like in simplified situations, where the off button is explicitly labelled for the AI and everything is well understood. If a proposed AI design does the wrong thing in the 3-step test case, then it has failed one of its unit tests, and should not be deployed to production (the real world). So the point of the paper is that a reasonable-sounding way you could design an AI with an off switch turns out to fail the unit-test.
I do generally think that too many of the AI-related posts here on LessWrong are “not real” in the way you’re suggesting, but this paper in particular seems “real” to me (whatever that means). I find the most “not real” posts are the verbose ones piled high with vague wordy abstractions, without an equation in sight. The equations in the corrigiblity paper aren’t there to seem impressive, they’re there to unambiguously communicate the math the paper is talking about, so that if the authors have made an error of reasoning, it will be as obvious as possible. The ways you keep something in contact with reality is checking either against experiment, or against the laws of mathematics. To quote Feynman, “if it disagrees with experiment, it’s wrong” and similarly, there’s a standard in mathematics that statements must be backed up by checkable calculations and proofs. So long as the authors are holding themselves to that standard (and so long as you agree that any well-designed AI should be able to perform well in this easy test case), then it’s “real”.