Yes, architecture matters. We don’t know what architectures are how likely to produce a rogue agent, but we have subjective expectations, and what a coincidence it would be that they should be the same in each case. For example, if an architecture easily solves a given task, it needs to be scaled up less and then I expect less opportunity for a mesa-optimizer to arise. Mixture of Experts is riskier if the chance that an expert of given size will go rogue is astronomically close to neither 0 nor 1; it is less risky if that chance scales too fast with size. Of course, it’s a shaky assumption that splitting a large network into experts will render them unable to form an entire rogue. My object-level arguments might be wrong, but our risk mitigation strategies should not disregard the question of architecture entirely.
Yes, architecture matters. We don’t know what architectures are how likely to produce a rogue agent, but we have subjective expectations, and what a coincidence it would be that they should be the same in each case. For example, if an architecture easily solves a given task, it needs to be scaled up less and then I expect less opportunity for a mesa-optimizer to arise. Mixture of Experts is riskier if the chance that an expert of given size will go rogue is astronomically close to neither 0 nor 1; it is less risky if that chance scales too fast with size. Of course, it’s a shaky assumption that splitting a large network into experts will render them unable to form an entire rogue. My object-level arguments might be wrong, but our risk mitigation strategies should not disregard the question of architecture entirely.