Anyone—and in particular Evhub—have updated views on this post with the benefit of hindsight?
I intuitively don’t like this approach, but I have trouble articulating exactly why. I’ve tried to explain a bit in this comment, but I don’t think I’m quite saying the right thing.
One issue I have is that it doesn’t seem to nicely handle interactions between the properties of the AI and how it’s used. You can have an AI which is safe when used in some ways, but not always. This could be due to approaches like control (which mostly route around mechanistic properties of the AI), but also potentially things like using monitoring ensembles to handle lack of robustness and paying AIs rather than aligning them.
Another problem I have is that this doesn’t very naturally incorporate various non-mechanistic analysis targeting specific threat models which IMO should be (and will be) very central. E.g., we built a wide variety of model organisms which are closely analogous to our training and deployment environment and which aim to uncover potential reward hacking failure modes and these model organisms didn’t demonstrate any issues. Same for things like adversarially testing for clear misalignment: it doesn’t result in a mechanistic model, but feels very central.
To be clear, I think all the things I discussed above can be discussed in this framework, but it feels quite unnatural and the decomposition doesn’t seem like it’s doing any work.
I think the type of mechanistic analysis proposed here seems quite aspirational with the current state of technology such that it feels odd to center it. Or the mechanistic analysis you do will apply to all training runs and no safety interventions will effect it such that it’s more like useful background than a key part of analyzing different safety measures. To be clear, we will want to do some mechanistic analysis and have some space of mechanistic hypothesis. But this feels more like the background threat model than the core safety case due to difficulties in testing. We can also somewhat test these mechanistic hypothesis with experiments that don’t require huge technological break throughs, but this seems more like an important sub-component of a safety case than the main thing.
Perhaps Evan thinks we’re totally screwed (or at least can’t obtain high confidence) without strong mechanistic analysis such that centering this is good. I think high confidence seems unclear and disagree with totally screwed. It’s possible that my views here partially come down to a difference of opinion with Evan where he thinks that deceptive alignment is very likely given usage of models capable of powerful goal-oriented behavior where as I think this is uncertain. Further, I think it’s reasonably likely (perhaps 1⁄3) that I’ll end up being very confident that deceptive alignment is very unlikely at the point when we have powerful AIs (due to experiments and further conceptual reasoning).
More generally, I feel like the way I currently talk and think about safety cases and similar topics doesn’t seem nicely fit into training stories. I think the way I currently do it is better, but I’m not entirely certain and I haven’t tried the training stories approach much.
I should also note that a general approach like training stories seems much better than a decomposition like “inner alignment” vs “outer alignment” which is supposing a particular approach to solving the problem. (I do think that “inner misalignment” vs “outer misalignment” is reasonable decomposition of threat models in AIs produced with ML. But these are threat models, not problems to be solved and there are many routes to solving them. See here for more discussion.)
I think I prefer the default trajectory of safety cases and RSP more than what would happen with additional emphasis on training stories, but I’m uncertain.
I intuitively don’t like this approach, but I have trouble articulating exactly why. I’ve tried to explain a bit in this comment, but I don’t think I’m quite saying the right thing.
One issue I have is that it doesn’t seem to nicely handle interactions between the properties of the AI and how it’s used. You can have an AI which is safe when used in some ways, but not always. This could be due to approaches like control (which mostly route around mechanistic properties of the AI), but also potentially things like using monitoring ensembles to handle lack of robustness and paying AIs rather than aligning them.
Another problem I have is that this doesn’t very naturally incorporate various non-mechanistic analysis targeting specific threat models which IMO should be (and will be) very central. E.g., we built a wide variety of model organisms which are closely analogous to our training and deployment environment and which aim to uncover potential reward hacking failure modes and these model organisms didn’t demonstrate any issues. Same for things like adversarially testing for clear misalignment: it doesn’t result in a mechanistic model, but feels very central.
To be clear, I think all the things I discussed above can be discussed in this framework, but it feels quite unnatural and the decomposition doesn’t seem like it’s doing any work.
I think the type of mechanistic analysis proposed here seems quite aspirational with the current state of technology such that it feels odd to center it. Or the mechanistic analysis you do will apply to all training runs and no safety interventions will effect it such that it’s more like useful background than a key part of analyzing different safety measures. To be clear, we will want to do some mechanistic analysis and have some space of mechanistic hypothesis. But this feels more like the background threat model than the core safety case due to difficulties in testing. We can also somewhat test these mechanistic hypothesis with experiments that don’t require huge technological break throughs, but this seems more like an important sub-component of a safety case than the main thing.
Perhaps Evan thinks we’re totally screwed (or at least can’t obtain high confidence) without strong mechanistic analysis such that centering this is good. I think high confidence seems unclear and disagree with totally screwed. It’s possible that my views here partially come down to a difference of opinion with Evan where he thinks that deceptive alignment is very likely given usage of models capable of powerful goal-oriented behavior where as I think this is uncertain. Further, I think it’s reasonably likely (perhaps 1⁄3) that I’ll end up being very confident that deceptive alignment is very unlikely at the point when we have powerful AIs (due to experiments and further conceptual reasoning).
More generally, I feel like the way I currently talk and think about safety cases and similar topics doesn’t seem nicely fit into training stories. I think the way I currently do it is better, but I’m not entirely certain and I haven’t tried the training stories approach much.
I should also note that a general approach like training stories seems much better than a decomposition like “inner alignment” vs “outer alignment” which is supposing a particular approach to solving the problem. (I do think that “inner misalignment” vs “outer misalignment” is reasonable decomposition of threat models in AIs produced with ML. But these are threat models, not problems to be solved and there are many routes to solving them. See here for more discussion.)
I think I prefer the default trajectory of safety cases and RSP more than what would happen with additional emphasis on training stories, but I’m uncertain.