I agree that extracting short-term modules from long-term modules is very much an open question. However, it may well be that our main problem would be the opposite: the systems would be trained already with short-term goals, and so we just want to make sure that they don’t accidentally develop a long-term goal in the process (this may be related to your mechanisms posts, which I will respond to separately)
I agree that’s a plausible goal, but I’m not convinced it will be so easy. The current state of our techniques is quite crude and there isn’t an obvious direction for being able to achieve this kind of goal.
(That said, I’m certainly not confident it’s hard, and there are lots of things to try—both at this stage and for other angles of attack. Of course this is part of how I end up more like 10-20% risk of trouble than a 80-90% risk of trouble.)
For example, while AI would greatly increase the capabilities for hacking, it would also increase the capabilities to harden our systems.
I agree with this. I think cybersecurity is an unusual domain where it is particularly plausible that “defender wins” even given a large capability gap (though it’s not the case right now!). I’m afraid there is more attack surface that are harder to harden. But I do think there’s a plausible gameplan here that I find scary but that even I would agree can at least delay trouble.
In general, I find research on prevention to be more attractive than alignment since it also applies to the scenario (more likely in my view) of malicious humans using AI to cause massive harm.
I think there is agreement that this scenario is more likely, the question is about the total harm (and to a lesser extent about how much concrete technical projects might reduce that risk). Cybersecurity improvements unquestionably have real social benefits, but cybersecurity investment is 2-3 orders of magnitude larger than AI alignment investment right now. In contrast, I’d argue that believe the total expected social cost of cybersecurity shortcomings is maybe an order of magnitude lower than alignment shortcomings, and I’d guess that other reasonable estimates for the ratio should be within 1-2 orders of magnitude of that.
If we were spending significantly more on alignment than cybersecurity, then I would be quite sympathetic to an argument to shift back in the other direction.
It also doesn’t require us to speculate about objects (long-term planning AIs) that don’t yet exist.
Research on alignment can focus on existing models—understanding those models, or improving their robustness, or developing mechanisms to oversee them in domains where they are superhuman, or so on. In fact this is a large majority of alignment research weighted by $ or hours spent.
To the extent that this research is ultimately intended to address risks that are distinctive to future AI, I agree that there is a key speculative step. But the same is true for research on prevention aimed to address risks from future AI. And indeed my position is that work on prevention will only modestly reduce these risks. So it seems like the situation is somewhat symmetrical: in both cases there are concrete problems we can work on today, and a more speculative hope that these problems will help address future risks.
Of course I’m also interested in theoretical problems that I expect to be relevant, which is in some sense more speculative (though in fairness I did spend 4 years doing experimental work at OpenAI). But on the flipside, I think it’s clear that there are plausible situations where standard ML approaches would lead to catastrophic misalignment, and we can study those situations whether or not they will occur in the real world. (Just as you could study cryptography in a computational regime that or may not ever become relevant in practice, based on a combination of “maybe it will” and “maybe this theoretical investigation will yield insight more relevant to realistic regimes.”)
As you probably imagine given my biography :) , I am never against any research, and definitely not for reasons of practical utility. So am definitely very supportive of research on alignment, and not claiming that it shouldn’t be done. In my view, one of the interesting technical questions is to what extent can long-term goals emerge from systems trained with short-term objectives, and (if it happens) whether we can prevent this while still keeping short-term performance as good. One reason I like the focus on the horizon rather than alignment with human values is that the former might be easier to define and argue about. But this doesn’t mean that we should not care about the latter.
I definitely think it’s interesting to understand and control whether a model is pursuing a long-horizon goal (though talking about the “goal” of a model seems quite slippery).
I think that most work on alignment doesn’t need to get into the difficulties of defining or arguing about human values. I’m normally focused more on goals like: “does my AI make statements that it knows to be unambiguously false?” (see ELK).
I agree that’s a plausible goal, but I’m not convinced it will be so easy. The current state of our techniques is quite crude and there isn’t an obvious direction for being able to achieve this kind of goal.
(That said, I’m certainly not confident it’s hard, and there are lots of things to try—both at this stage and for other angles of attack. Of course this is part of how I end up more like 10-20% risk of trouble than a 80-90% risk of trouble.)
I agree with this. I think cybersecurity is an unusual domain where it is particularly plausible that “defender wins” even given a large capability gap (though it’s not the case right now!). I’m afraid there is more attack surface that are harder to harden. But I do think there’s a plausible gameplan here that I find scary but that even I would agree can at least delay trouble.
I think there is agreement that this scenario is more likely, the question is about the total harm (and to a lesser extent about how much concrete technical projects might reduce that risk). Cybersecurity improvements unquestionably have real social benefits, but cybersecurity investment is 2-3 orders of magnitude larger than AI alignment investment right now. In contrast, I’d argue that believe the total expected social cost of cybersecurity shortcomings is maybe an order of magnitude lower than alignment shortcomings, and I’d guess that other reasonable estimates for the ratio should be within 1-2 orders of magnitude of that.
If we were spending significantly more on alignment than cybersecurity, then I would be quite sympathetic to an argument to shift back in the other direction.
Research on alignment can focus on existing models—understanding those models, or improving their robustness, or developing mechanisms to oversee them in domains where they are superhuman, or so on. In fact this is a large majority of alignment research weighted by $ or hours spent.
To the extent that this research is ultimately intended to address risks that are distinctive to future AI, I agree that there is a key speculative step. But the same is true for research on prevention aimed to address risks from future AI. And indeed my position is that work on prevention will only modestly reduce these risks. So it seems like the situation is somewhat symmetrical: in both cases there are concrete problems we can work on today, and a more speculative hope that these problems will help address future risks.
Of course I’m also interested in theoretical problems that I expect to be relevant, which is in some sense more speculative (though in fairness I did spend 4 years doing experimental work at OpenAI). But on the flipside, I think it’s clear that there are plausible situations where standard ML approaches would lead to catastrophic misalignment, and we can study those situations whether or not they will occur in the real world. (Just as you could study cryptography in a computational regime that or may not ever become relevant in practice, based on a combination of “maybe it will” and “maybe this theoretical investigation will yield insight more relevant to realistic regimes.”)
As you probably imagine given my biography :) , I am never against any research, and definitely not for reasons of practical utility. So am definitely very supportive of research on alignment, and not claiming that it shouldn’t be done. In my view, one of the interesting technical questions is to what extent can long-term goals emerge from systems trained with short-term objectives, and (if it happens) whether we can prevent this while still keeping short-term performance as good. One reason I like the focus on the horizon rather than alignment with human values is that the former might be easier to define and argue about. But this doesn’t mean that we should not care about the latter.
I definitely think it’s interesting to understand and control whether a model is pursuing a long-horizon goal (though talking about the “goal” of a model seems quite slippery).
I think that most work on alignment doesn’t need to get into the difficulties of defining or arguing about human values. I’m normally focused more on goals like: “does my AI make statements that it knows to be unambiguously false?” (see ELK).