So a convincing argument is trying to establish a universal quantifier over potential solutions to a possible problem.
This seems like a hard thing to do that most people may not have much experience with (especially since the problems are only defined informally at this point). Can you link to some existing such arguments, either against this AI alignment approach (that previously caused you to change your vision), or on other topics, to give a sense of what kinds of techniques might be helpful for establishing such a universal quantifier?
For example should one try to define the problem formally and then mathematically prove that no solution exists? But how does one show that there’s not an alternative formal definition of the problem (that still captures the essence of the informal problem) for which a solution does exist?
This comment of yours changed my thinking about security amplification by cutting off some lines of argument and forced me to lower my overall goals (though it is simple enough that it feels like it should have been clear in advance). I believe the scheme overall survives, as I discussed at the workshop, but in a slightly different form.
This post by Jessica both does a good job of overviewing some concerns and makes a novel argument (if the importance weight is slightly wrong then you totally lose) that leaves me very skeptical about any importance-weighting approach to fixing Solomonoff induction, which in turn leaves me more skeptical about “direct” approaches to benign induction.
In this post I listed implicit ensembling as an approach to robustness. Between Jessica’s construction described here and discussions with MIRI folk arguing persuasively that the number of extra bits needed to get honesty was reasonably large such that even a good KWIK bound would be mediocre (partially described by Jessica here) I ended up pessimistic.
To clarify, when I say “trying to establish” I don’t mean “trying to establish in a rigorous way,” I just mean that that the goal of the informal reasoning should be the informal conclusion “we won’t be able to find a way around this problem.” It’s also not a literal universal quantifier, in the same way that cryptography isn’t up against against a literal universal quantifier, so I was doubly sloppy.
I don’t think that a mathematical proof is likely to be convincing on its own (as you point out, there is a lot of slack in the choice of formalization). It might be helpful as part of an argument, though I doubt that’s going to be where the action is.
This seems like a hard thing to do that most people may not have much experience with (especially since the problems are only defined informally at this point). Can you link to some existing such arguments, either against this AI alignment approach (that previously caused you to change your vision), or on other topics, to give a sense of what kinds of techniques might be helpful for establishing such a universal quantifier?
For example should one try to define the problem formally and then mathematically prove that no solution exists? But how does one show that there’s not an alternative formal definition of the problem (that still captures the essence of the informal problem) for which a solution does exist?
Some examples that come to mind:
This comment of yours changed my thinking about security amplification by cutting off some lines of argument and forced me to lower my overall goals (though it is simple enough that it feels like it should have been clear in advance). I believe the scheme overall survives, as I discussed at the workshop, but in a slightly different form.
This post by Jessica both does a good job of overviewing some concerns and makes a novel argument (if the importance weight is slightly wrong then you totally lose) that leaves me very skeptical about any importance-weighting approach to fixing Solomonoff induction, which in turn leaves me more skeptical about “direct” approaches to benign induction.
In this post I listed implicit ensembling as an approach to robustness. Between Jessica’s construction described here and discussions with MIRI folk arguing persuasively that the number of extra bits needed to get honesty was reasonably large such that even a good KWIK bound would be mediocre (partially described by Jessica here) I ended up pessimistic.
None of these posts use heavy machinery.
To clarify, when I say “trying to establish” I don’t mean “trying to establish in a rigorous way,” I just mean that that the goal of the informal reasoning should be the informal conclusion “we won’t be able to find a way around this problem.” It’s also not a literal universal quantifier, in the same way that cryptography isn’t up against against a literal universal quantifier, so I was doubly sloppy.
I don’t think that a mathematical proof is likely to be convincing on its own (as you point out, there is a lot of slack in the choice of formalization). It might be helpful as part of an argument, though I doubt that’s going to be where the action is.