TurnTrout comments on A shot at the diamond-alignment problem

TurnTrout 7 Oct 2022 18:04 UTC
LW: 3 AF: 2
0
AF
if every shard has a veto over plans, and the shards are individually quite intelligent subagents
I think this won’t happen FWIW.
and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we’d expect to be uncorrelated—conditions which cause one proxy to fail probably cause many to fail in similar ways.)
Can you provide a concrete instantiation of this argument? (ETA: struck this part, want to hear your response first to make sure it’s engaging with what you had in mind)
I expect training dynamics will increase the weight on a component exponentially w.r.t. the number of bits it correctly “predicts”
1. What about your argument behaves differently in the presence of humans and AI? This is clearly not how shard dynamics work in people, as I understand your argument.
2. We aren’t in the prediction regime, insofar as that is supposed to be relevant for your argument. Let’s talk about the batch update, and not make analogies to predictions. (Although perhaps I was the one who originally brought it up in OP, I should rewrite that.)
3. Can you give me a concrete example of an “exploiting shard” in this situation which is learnable early on, relative to the actual diamond-shards?
And on top of that, a mix of shards does not imply a mix of behavior; if a highly correlated subset of the shards controls a sufficiently large chunk of the weight, then they’ll have de-facto control over the agent’s behavior.
The point I am arguing (ETA and I expect Quintin is as well, but maybe not) is that this will be one of the primary shards produced, not that there’s a chance it exists at low weight or something.