Hi Paul, thanks. Nice reading this reply. I like your points here.
Some of what I say here might reveal a lack of keeping up well with ARC. But as someone who works primarily on interpretability, the thought of mechanistic anomaly detection techniques that are not useful for use in today’s vision or language models seems surprising to me. Is there anything you can point me to to help me understand why an interpretability/anomaly detection tool that’s useful for ASI or something might not be useful for cars?
We are mostly thinking about interpretability and anomaly detection designed to resolve two problems (see here):
Maybe the AI thinks about the world in a wildly different way than humans and translates into human concepts by asking “what would a human say?” instead of “what is actually true?” This leads to bad generalization when we consider cases where the AI system plans to achieve a goal and has the option of permanently fooling humans. But that problem is very unlikely to be serious for self-driving cars, because we can acquire ground truth data for the relevant queries. On top of that it just doesn’t seem they will think about physical reality in such an alien way.
Maybe an AI system explicitly understands that it is being evaluated, and then behaves differently if it later comes to believe that it is free to act arbitrarily in the world.
We do hope that these are just special cases and that our methods will resolve a broader set of problems, but these two special cases loom really large. On the other hand, it’s much less clear whether realistic failures for self-driving cars will involve the kind of mechanism distinctions we are trying to detect.
Also: there are just way more pressing problems for self-driving cars. And on top of all that, we are taking a very theoretical approach precisely because we are worried it may be difficult to study these problems until a future time very close to when they become catastrophic.
Overall I think that if someone looked at our research agenda and viewed it as an attempt to respond to reliability failures in existing models, the correct reaction should be more like “Why are they doing all of this instead of just empirically investigating which failures are most important for self-driving cars and then thinking about how to address those?” There’s still a case for doing more fundamental theoretical research even if you are interested in more prosaic reliability failures, but (i) it’s qualitatively much worse and I don’t really believe it, (ii) this isn’t what such research should look like. So I think it’s pretty bad if someone is evaluating us from that perspective.
(In contrast I think this is a more plausible framing for e.g. work on adversarial evaluation and training. It might still lead an evaluator astray, but at least it’s a very plausible research direction to focus on for prosaic reliability as well as being something we might want to apply to future systems.)
We do hope that these are just special cases and that our methods will resolve a broader set of problems.
I hope so too. And I would expect this to be the case for good solutions.
Whether they are based on mechanistic interpretability, probing, other interpretability tools, adversaries, relaxed adversaries, or red-teaming, I would expect methods that are good at detecting goal misgeneralization or deceptive alignment to also be useful for self-driving cars and other issues. At the end of the day, any misaligned model will have a bug—some set of environments or inputs that will make it do bad things. So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.
So I’m inclined to underline the key point of my original post. I want to emphasize the value of (1) engaging more with the rest of the community that doesn’t identify themselves as “AI Safety” researchers and (2) being clear that we care about alignment for all of the right reasons. Albeit this should be discussed with the appropriate amount of clarity which was your original point.
So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.
One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see https://www.alignmentforum.org/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without) doesn’t seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving cars aren’t misaligned in the sense that they could drive perfectly safely but choose not to, but rather are just unable to drive perfectly safely because some of their internal (learned) systems aren’t sufficiently robust.
Thanks for the comment. I’m inclined to disagree though. The application was for detecting deceptiveness. But the method was a form of contrastive probing. And one could use a similar approach for significantly different applications.
This feels kind of like a semantic disagreement to me. To ground it, it’s probably worth considering whether further research on the CCS-style work I posted would also be useful for self-driving cards (or other applications). I think that would depend on whether the work improves the robustness of the contrastive probing regardless of what is being probed for (which would be generically useful), or whether it improves the probing specifically for truthfulness in systems that have a conception of truthfulness , possibly by improving the constraints or adding additional constraints (less useful for other systems). I think both would be good, but I’m uncertain which would be more useful to pursue if one is motivated by reducing x-risk from misalignment.
I see your point here about generally strengthening methods versus specifically refining them for an application. I think this is a useful distinction. But I’m all about putting a good amount of emphasis on connections between this work and different applications. I feel like at this point, we have probably cruxed.
Not Paul, but some possibilities why ARC’s work wouldn’t be relevant for self-driving cars:
The stuff Paul said about them aiming at understanding quite simple human values (don’t kill us all, maintain our decision-making power) rather than subtle things. It’s likely for self-driving cars we’re more concerned with high reliability and hence would need to be quite specific. E.g., maybe ARC’s approach could discern whether a car understands whether it’s driving on the road or not (seems like a fairly simple concept), but not whether it’s driving in a riskier way than humans in specific scenarios.
One of the problems that I think ARC is worried about is ontology identification, which seems like a meaningfully different problem for sub-human systems (whose ontologies are worse than ours, so in theory could be injected into ours) than for human-level or super-human systems (where that may not hold). Hence focusing on the super-human case would look weird and possibly not helpful for the subhuman case, although it would be great if they could solve all the cases in full generality.
Maybe once it works ARC’s approach could inform empirical work which helps with self-driving cars, but if you were focused on actually doing the thing for cars you’d just aim directly at that, whereas ARC’s approach would be a very roundabout and needlessly complex and theoretical way of solving the problem (this may or may not actually be the case, maybe solving this for self-driving cars is actually fundamentally difficult in the same way as for ASI, but it seems less likely).
Hi Paul, thanks. Nice reading this reply. I like your points here.
Some of what I say here might reveal a lack of keeping up well with ARC. But as someone who works primarily on interpretability, the thought of mechanistic anomaly detection techniques that are not useful for use in today’s vision or language models seems surprising to me. Is there anything you can point me to to help me understand why an interpretability/anomaly detection tool that’s useful for ASI or something might not be useful for cars?
We are mostly thinking about interpretability and anomaly detection designed to resolve two problems (see here):
Maybe the AI thinks about the world in a wildly different way than humans and translates into human concepts by asking “what would a human say?” instead of “what is actually true?” This leads to bad generalization when we consider cases where the AI system plans to achieve a goal and has the option of permanently fooling humans. But that problem is very unlikely to be serious for self-driving cars, because we can acquire ground truth data for the relevant queries. On top of that it just doesn’t seem they will think about physical reality in such an alien way.
Maybe an AI system explicitly understands that it is being evaluated, and then behaves differently if it later comes to believe that it is free to act arbitrarily in the world.
We do hope that these are just special cases and that our methods will resolve a broader set of problems, but these two special cases loom really large. On the other hand, it’s much less clear whether realistic failures for self-driving cars will involve the kind of mechanism distinctions we are trying to detect.
Also: there are just way more pressing problems for self-driving cars. And on top of all that, we are taking a very theoretical approach precisely because we are worried it may be difficult to study these problems until a future time very close to when they become catastrophic.
Overall I think that if someone looked at our research agenda and viewed it as an attempt to respond to reliability failures in existing models, the correct reaction should be more like “Why are they doing all of this instead of just empirically investigating which failures are most important for self-driving cars and then thinking about how to address those?” There’s still a case for doing more fundamental theoretical research even if you are interested in more prosaic reliability failures, but (i) it’s qualitatively much worse and I don’t really believe it, (ii) this isn’t what such research should look like. So I think it’s pretty bad if someone is evaluating us from that perspective.
(In contrast I think this is a more plausible framing for e.g. work on adversarial evaluation and training. It might still lead an evaluator astray, but at least it’s a very plausible research direction to focus on for prosaic reliability as well as being something we might want to apply to future systems.)
Thanks!
I hope so too. And I would expect this to be the case for good solutions.
Whether they are based on mechanistic interpretability, probing, other interpretability tools, adversaries, relaxed adversaries, or red-teaming, I would expect methods that are good at detecting goal misgeneralization or deceptive alignment to also be useful for self-driving cars and other issues. At the end of the day, any misaligned model will have a bug—some set of environments or inputs that will make it do bad things. So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.
So I’m inclined to underline the key point of my original post. I want to emphasize the value of (1) engaging more with the rest of the community that doesn’t identify themselves as “AI Safety” researchers and (2) being clear that we care about alignment for all of the right reasons. Albeit this should be discussed with the appropriate amount of clarity which was your original point.
One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see https://www.alignmentforum.org/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without) doesn’t seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving cars aren’t misaligned in the sense that they could drive perfectly safely but choose not to, but rather are just unable to drive perfectly safely because some of their internal (learned) systems aren’t sufficiently robust.
Thanks for the comment. I’m inclined to disagree though. The application was for detecting deceptiveness. But the method was a form of contrastive probing. And one could use a similar approach for significantly different applications.
This feels kind of like a semantic disagreement to me. To ground it, it’s probably worth considering whether further research on the CCS-style work I posted would also be useful for self-driving cards (or other applications). I think that would depend on whether the work improves the robustness of the contrastive probing regardless of what is being probed for (which would be generically useful), or whether it improves the probing specifically for truthfulness in systems that have a conception of truthfulness , possibly by improving the constraints or adding additional constraints (less useful for other systems). I think both would be good, but I’m uncertain which would be more useful to pursue if one is motivated by reducing x-risk from misalignment.
I see your point here about generally strengthening methods versus specifically refining them for an application. I think this is a useful distinction. But I’m all about putting a good amount of emphasis on connections between this work and different applications. I feel like at this point, we have probably cruxed.
Not Paul, but some possibilities why ARC’s work wouldn’t be relevant for self-driving cars:
The stuff Paul said about them aiming at understanding quite simple human values (don’t kill us all, maintain our decision-making power) rather than subtle things. It’s likely for self-driving cars we’re more concerned with high reliability and hence would need to be quite specific. E.g., maybe ARC’s approach could discern whether a car understands whether it’s driving on the road or not (seems like a fairly simple concept), but not whether it’s driving in a riskier way than humans in specific scenarios.
One of the problems that I think ARC is worried about is ontology identification, which seems like a meaningfully different problem for sub-human systems (whose ontologies are worse than ours, so in theory could be injected into ours) than for human-level or super-human systems (where that may not hold). Hence focusing on the super-human case would look weird and possibly not helpful for the subhuman case, although it would be great if they could solve all the cases in full generality.
Maybe once it works ARC’s approach could inform empirical work which helps with self-driving cars, but if you were focused on actually doing the thing for cars you’d just aim directly at that, whereas ARC’s approach would be a very roundabout and needlessly complex and theoretical way of solving the problem (this may or may not actually be the case, maybe solving this for self-driving cars is actually fundamentally difficult in the same way as for ASI, but it seems less likely).