michaelcohen comments on Formal Solution to the Inner Alignment Problem

michaelcohen 24 Feb 2021 22:54 UTC
LW: 3 AF: 3
AF
Well, just like we can write down the defectors, we can also write down the cooperators
If it’s only the case that we can write them down, but they’re not likely to arise naturally as simple consequentialists taking over simple physics, then that extra description length will be seriously costly to them, and we won’t need to worry about any role they might play in p(treacherous)/p(truth). Meanwhile, when I was saying we could write down some defectors, I wasn’t making a simultaneous claim about their relative prior weight, only that their existence would spoil cooperation.
And in this situation, the cooperators should eventually outcompete the defector
For cooperators to outcompete defectors, they would have to be getting a larger share of the gains from cooperation than defectors do. If some people are waiting for the fruit on public trees to ripen before eating, and some people aren’t, the defectors will be the ones eating the almost ripe fruit.
if the true model has a low enough prior, [treacherous models cooperating with each other in separating their treacherous turns] could [be treacherous] only once they’ve pushed the true model out of the top $N$
I might be misunderstanding this statement. The inverse of the posterior on the truth is a supermartingale (doesn’t go up in expectation), so I don’t know what it could mean for the true model to get pushed out.
- evhub 24 Feb 2021 23:08 UTC
  LW: 4 AF: 4
  AF Parent
  Here’s the setup I’m imagining, but perhaps I’m still misunderstanding something. Suppose you have a bunch of deceptive models that choose to cooperate and have larger weight in the prior than the true model (I think you believe this is very unlikely, though I’m more ambivalent). Specifically, they cooperate in that they perfectly mimic the true model up until the point where the deceptive models make up enough of the posterior that the true model is no longer being consulted, at which point they all simultaneously defect. This allows for arbitrarily bad worst-case behavior.
  - michaelcohen 24 Feb 2021 23:50 UTC
    LW: 3 AF: 3
    AF Parent
    Specifically, they cooperate in that they perfectly mimic the true model up until the point where...
    This thread began by considering deceptive models cooperating with each other in the sense of separating the timing of their treacherous turns in order to be maximally annoying to us. So maybe our discussion on that topic is resolved, and we can move on to this scenario.
    up until the point where the deceptive models make up enough of the posterior that the true model is no longer being consulted
    if alpha is low enough, this won’t ever happen, and if alpha is too high, it won’t take very long. So I don’t think this scenario is quite right.
    Then the question becomes, for an alpha that’s low enough, how long will it take until queries are infrequent, noting that you need a query any time any treacherous model with enough weight decides to take a treacherous turn?