This just feels like pretend, made-up research that they put math equations in to seem like it’s formal and rigorous.
Can you elaborate which parts feel made-up to you? E.g.:
modelling a superintelligent agent as a utility maximizer
considering a 3-step toy model with A1, O, A2
assuming that a specification of US exists
At the end of all those questions, I feel no closer to knowing if a machine would stop you from pressing a button to shut it off.
The authors do not claim to have solved the problem and instead state that this is an open problem. So this is not surprising that there is not a satisfying answer.
I would also like to note, that the paper has many more caveats.
Do you think it would still feel fake to you if the paper had a more positive answer to the problem described (eg a description how to modify a utility function of an agent in a toy model such that it does not incentivize the agent to prevent/cause the pressing of the shutdown button)?
I suppose modelling a superintelligent agent as a utility maximizer feels a bit weird but not the weirdest thing, and I’m not sure I can mount a good defense saying that a superintelligent agent definitely wouldn’t be aptly modeled by that.
More importantly, the 3-step toy model with A1, O, A2 felt like a strange and unrelated leap
I don’t know if it’s about the not having an answer part. That is probably biasing me. But similar to the cryptography example, if someone defined what security would mean, let’s say Indistinguishability under chosen plain text attack. And then proceeded to say “I have no idea how to do that or if it’s even possible.” Then I would still consider that real even though they didn’t give us an answer.
Looking at the paper makes me feel like the authors were just having some fun discussing philosophy and not “ah yes this will be important for the fight later”. But it is hard for me to understand why I feel that way.
I am somewhat satisfied by the cryptography comparison for now but definitely hard to see how valuable this is as opposed to general interpretability research.
From a pure world-modelling perspective, the 3 step model is not very interesting, because it doesn’t describe reality. It’s maybe best to think of it from an engineering perspective, as a test case. We’re trying to build an AI, and we want to make sure it works well. We don’t know exactly what that looks like in the real world, but we know what it looks like in simplified situations, where the off button is explicitly labelled for the AI and everything is well understood. If a proposed AI design does the wrong thing in the 3-step test case, then it has failed one of its unit tests, and should not be deployed to production (the real world). So the point of the paper is that a reasonable-sounding way you could design an AI with an off switch turns out to fail the unit-test.
I do generally think that too many of the AI-related posts here on LessWrong are “not real” in the way you’re suggesting, but this paper in particular seems “real” to me (whatever that means). I find the most “not real” posts are the verbose ones piled high with vague wordy abstractions, without an equation in sight. The equations in the corrigiblity paper aren’t there to seem impressive, they’re there to unambiguously communicate the math the paper is talking about, so that if the authors have made an error of reasoning, it will be as obvious as possible. The ways you keep something in contact with reality is checking either against experiment, or against the laws of mathematics. To quote Feynman, “if it disagrees with experiment, it’s wrong” and similarly, there’s a standard in mathematics that statements must be backed up by checkable calculations and proofs. So long as the authors are holding themselves to that standard (and so long as you agree that any well-designed AI should be able to perform well in this easy test case), then it’s “real”.
Can you elaborate which parts feel made-up to you? E.g.:
modelling a superintelligent agent as a utility maximizer
considering a 3-step toy model with A1, O, A2
assuming that a specification of US exists
The authors do not claim to have solved the problem and instead state that this is an open problem. So this is not surprising that there is not a satisfying answer.
I would also like to note, that the paper has many more caveats.
Do you think it would still feel fake to you if the paper had a more positive answer to the problem described (eg a description how to modify a utility function of an agent in a toy model such that it does not incentivize the agent to prevent/cause the pressing of the shutdown button)?
I suppose modelling a superintelligent agent as a utility maximizer feels a bit weird but not the weirdest thing, and I’m not sure I can mount a good defense saying that a superintelligent agent definitely wouldn’t be aptly modeled by that.
More importantly, the 3-step toy model with A1, O, A2 felt like a strange and unrelated leap
I don’t know if it’s about the not having an answer part. That is probably biasing me. But similar to the cryptography example, if someone defined what security would mean, let’s say Indistinguishability under chosen plain text attack. And then proceeded to say “I have no idea how to do that or if it’s even possible.” Then I would still consider that real even though they didn’t give us an answer.
Looking at the paper makes me feel like the authors were just having some fun discussing philosophy and not “ah yes this will be important for the fight later”. But it is hard for me to understand why I feel that way.
I am somewhat satisfied by the cryptography comparison for now but definitely hard to see how valuable this is as opposed to general interpretability research.
From a pure world-modelling perspective, the 3 step model is not very interesting, because it doesn’t describe reality. It’s maybe best to think of it from an engineering perspective, as a test case. We’re trying to build an AI, and we want to make sure it works well. We don’t know exactly what that looks like in the real world, but we know what it looks like in simplified situations, where the off button is explicitly labelled for the AI and everything is well understood. If a proposed AI design does the wrong thing in the 3-step test case, then it has failed one of its unit tests, and should not be deployed to production (the real world). So the point of the paper is that a reasonable-sounding way you could design an AI with an off switch turns out to fail the unit-test.
I do generally think that too many of the AI-related posts here on LessWrong are “not real” in the way you’re suggesting, but this paper in particular seems “real” to me (whatever that means). I find the most “not real” posts are the verbose ones piled high with vague wordy abstractions, without an equation in sight. The equations in the corrigiblity paper aren’t there to seem impressive, they’re there to unambiguously communicate the math the paper is talking about, so that if the authors have made an error of reasoning, it will be as obvious as possible. The ways you keep something in contact with reality is checking either against experiment, or against the laws of mathematics. To quote Feynman, “if it disagrees with experiment, it’s wrong” and similarly, there’s a standard in mathematics that statements must be backed up by checkable calculations and proofs. So long as the authors are holding themselves to that standard (and so long as you agree that any well-designed AI should be able to perform well in this easy test case), then it’s “real”.