harfe comments on [missing post]

harfe 24 Nov 2022 13:11 UTC
4 points
3
This just feels like pretend, made-up research that they put math equations in to seem like it’s formal and rigorous.

Can you elaborate which parts feel made-up to you? E.g.:
- modelling a superintelligent agent as a utility maximizer
- considering a 3-step toy model with $A_{1}$ , $O$ , $A_{2}$
- assuming that a specification of $U_{S}$ exists
At the end of all those questions, I feel no closer to knowing if a machine would stop you from pressing a button to shut it off.

The authors do not claim to have solved the problem and instead state that this is an open problem. So this is not surprising that there is not a satisfying answer.

I would also like to note, that the paper has many more caveats.

Do you think it would still feel fake to you if the paper had a more positive answer to the problem described (eg a description how to modify a utility function of an agent in a toy model such that it does not incentivize the agent to prevent/cause the pressing of the shutdown button)?
- joraine 26 Nov 2022 5:53 UTC
  3 points
  0
  Parent
  .
  - DaemonicSigil 26 Nov 2022 8:59 UTC
    3 points
    0
    Parent
    From a pure world-modelling perspective, the 3 step model is not very interesting, because it doesn’t describe reality. It’s maybe best to think of it from an engineering perspective, as a test case. We’re trying to build an AI, and we want to make sure it works well. We don’t know exactly what that looks like in the real world, but we know what it looks like in simplified situations, where the off button is explicitly labelled for the AI and everything is well understood. If a proposed AI design does the wrong thing in the 3-step test case, then it has failed one of its unit tests, and should not be deployed to production (the real world). So the point of the paper is that a reasonable-sounding way you could design an AI with an off switch turns out to fail the unit-test.
    
    I do generally think that too many of the AI-related posts here on LessWrong are “not real” in the way you’re suggesting, but this paper in particular seems “real” to me (whatever that means). I find the most “not real” posts are the verbose ones piled high with vague wordy abstractions, without an equation in sight. The equations in the corrigiblity paper aren’t there to seem impressive, they’re there to unambiguously communicate the math the paper is talking about, so that if the authors have made an error of reasoning, it will be as obvious as possible. The ways you keep something in contact with reality is checking either against experiment, or against the laws of mathematics. To quote Feynman, “if it disagrees with experiment, it’s wrong” and similarly, there’s a standard in mathematics that statements must be backed up by checkable calculations and proofs. So long as the authors are holding themselves to that standard (and so long as you agree that any well-designed AI should be able to perform well in this easy test case), then it’s “real”.