michaelcohen comments on Formal Solution to the Inner Alignment Problem

michaelcohen 24 Feb 2021 23:50 UTC
LW: 3 AF: 3
AF
Specifically, they cooperate in that they perfectly mimic the true model up until the point where...
This thread began by considering deceptive models cooperating with each other in the sense of separating the timing of their treacherous turns in order to be maximally annoying to us. So maybe our discussion on that topic is resolved, and we can move on to this scenario.
up until the point where the deceptive models make up enough of the posterior that the true model is no longer being consulted
if alpha is low enough, this won’t ever happen, and if alpha is too high, it won’t take very long. So I don’t think this scenario is quite right.
Then the question becomes, for an alpha that’s low enough, how long will it take until queries are infrequent, noting that you need a query any time any treacherous model with enough weight decides to take a treacherous turn?