I appreciate this comment, because I think anyone who is trying to do these kinds of interventions needs to be constantly vigilant about exactly what you are mentioning. I am not excited about loads of inexperienced people suddenly trying to suddenly do big things in AI strategy, because downsides can be so high. Even people I trust are likely to make a lot of miscalculations. And the epistemics can be bad.
I wouldn’t be excited about (for example) retreats with undergrads to learn about “how you can help buy more time.” I’m not even sure of the sign of interventions people I trust pursue, let alone people with no experience who have barely thought about the problem. As somebody who is very inexperienced but interested in AI strategy, I will note that you do have to start somewhere to learn anything.
That said—and I don’t think you dispute this—we cannot afford to leave strategy/fieldbuilding/policy off the table. In my view, a huge part of making AI go well is going to depend on forces beyond the purely technical, but I understand that this varies depending on one’s threat models. Societechnical systems are very difficult to influence, and it’s easy to influence them in an incorrect direction. Not everyone should try to do this, and even people who are really smart and thinking clearly may have negative effects. But I think we still have to try.
I want to push back on your suggestion that safety culture is not relevant. I agree that being vaguely “concerned” does seem not very useful. But safety culture seems very important. Things like (paraphrased from this paper):
Preoccupation with failure, especially black swan events and unseen failures.
Reluctance to simplify interpretations and explain failures using only simplistic narratives.
Sensitivity to operations, which involves closely monitoring systems for unexpected behavior.
Commitment to resilience, which means being rapidly adaptable to change and willing to try new ideas when faced with unexpected circumstances.
Under-specification of organizational structures, where new information can travel throughout the entire organization rather than relying only on fixed reporting chains.
In my view, having this kind of culture (which is analagous to having a security mindset) proliferate would be a nearly unalloyed good. Notice there is nothing here that necessarily says “slow down”—one failure mode with telling the most safety conscious people to simply slow down is, of course, that less safety conscious people don’t. Rather, simply awareness and understanding, and taking safety seriously, I think is robustly positive regardless of the strategic situation. Doing as much as possible to extract out the old fashioned “move fast and break things” ethos and replace it with a safety-oriented ethos would be very helpful (though I will emphasize, will not solve everything).
Lastly, regarding this:
with people walking away thinking that AI Alignment people want to teach AIs how to reproduce moral philosophy, or that OpenAIs large language model have been successfully “aligned”, or that we just need to throw some RLHF at the problem and the AI will learn our values fine
I’m curious if you think that these topics have no relevance to alignment, or whether you’re saying it’s problematic that people come away thinking that these things are the whole picture? Because I see all of these as relevant (to varying degrees), but certainly not sufficient or the whole picture.
I want to push back on your suggestion that safety culture is not relevant. I agree that being vaguely “concerned” does seem not very useful. But safety culture seems very important. Things like (paraphrased from this paper):
Sorry, I don’t want to say that safety culture is not relevant, but I want to say something like… “the real safety culture is just ‘sanity’”.
Like, I think you are largely playing a losing game if you try to get people to care about “safety” if the people making decisions are incapably of considering long-chained arguments, or are in incentives structures that make it inherently difficult to consider new arguments, or like, belief that humanity cannot go extinct because that would go against the will of god.
Like, I think I am excited about people approaching the ML field with the perspective of “how can I make these people more sane, and help them see the effects of their actions more clearly”, and less excited about “how can I make people say that they support my efforts, give resources to my tribe, and say lots of applause lights that are associated with my ingroup”. I am quite worried we are doing a lot of the latter, and little of the former (I do think that we should also be a bit hesitant about the former, since marginal sanity might people mostly just better at making AGI, though I think the tradeoff is worth it). I think it’s actually really hard to have a community whose advocacy doesn’t primarily end up oriented around being self-recommending.
I’m curious if you think that these topics have no relevance to alignment, or whether you’re saying it’s problematic that people come away thinking that these things are the whole picture?
I think none of these topics have almost any relevance to AI Alignment. Like, I do want my AI in some sense to learn moral philosophy, but having it classify sentences as “utilitarian” or “deontological” feels completely useless to me. I also think OpenAIs large language model efforts have approximately nothing to do with actual alignment, and indeed I think are mostly doing the opposite of alignment (training systems to build human models, become agentic, and be incentivized towards deception, though I recognize this is not a consensus view in the AI Alignment field). The last one is just a common misunderstanding and like, a fine one to clear up, but of course I think if someone comes away from our materials believing that, then that’s pretty terrible since it will likely further emboldem them to mostly just run ahead with AI.
I think I essentially agree with respect to your definition of “sanity,” and that it should be a goal. For example, just getting people to think more about tail risk seems like your definition of “sanity” and my definition of “safety culture.” I agree that saying that they support my efforts and say applause lights is pretty bad, though it seems weird to me to discount actual resources coming in.
As for the last bit: trying to figure out the crux here. Are you just not very concerned about outer alignment/proxy gaming? I think if I was totally unconcerned about that, and only concerned with inner alignment/deception, I would think those areas were useless. As it is, I think a lot of the work is actively harmful (because it is mostly just advances capabilities) but it still may help chip away at the proxy gaming problem.
I think RLHF doesn’t make progress on outer alignment (like, it directly creates a reward for deceiving humans). I think the case for RLHF can’t route through solving outer alignment in the limit, but has to route through somehow making AIs more useful as supervisors or as researchers for solving the rest of the AI Alignment problem.
Like, I don’t see how RLHF “chips away at the proxy gaming problem”. In some sense it makes the problem much harder since you are directly pushing on deception as the limit to the proxy gaming problem, which is the worst case for the problem.
I’m not a big supporter of RLHF myself, but my steelman is something like:
RLHF is a pretty general framework for conforming a system to optimize for something that can’t clearly be defined. If we just did, “if a human looked at this for a second would they like it?” this does provide a reward signal towards deception, but also towards genuinely useful behavior. You can take steps to reduce the deception component, for example by letting the humans red team the model or do some kind of transparency; this can all theoretically fit in the framework of RLHF. One could try to make the human feedback as robust as possible by adding all sorts of bells and whistles, and this would improve reliability and reduce the deception reward signal. It could be argued this still isn’t sufficient because the model will still find some way around it, but beware too much asymptotic reasoning.
I personally do not find RLHF very appealing since I think it complexifies things unnecessarily and is too correlated with capabilities at the moment. I prefer approaches to actually try to isolate the things people actually care about (i.e. their values), add some level of uncertainty (moral uncertainty, etc.) to it, try to make these proxies as robust as possible, and make them adaptive to changes in the models that are trying to constantly exploit and Goodhart them.
I worry that people will skip the post, read this comment, and misunderstand the post, so I want to point out how this comment might be misleading, even though it’s a great comment.
None of the interventions in the post are “go work at OpenAI to change things from the inside.” And only the outreach ones sound anything like “going around and convincing others.” And there’s a disclaimer that these interventions have serious downside risks and require extremely competent execution.
EDIT: one idea in the 2nd post is to join safety and governance teams at top labs like OpenAI. This seems reasonable to me? (“Go work on capabilities at OpenAI to change things” would sound unreasonable.)
I appreciate this comment, because I think anyone who is trying to do these kinds of interventions needs to be constantly vigilant about exactly what you are mentioning. I am not excited about loads of inexperienced people suddenly trying to suddenly do big things in AI strategy, because downsides can be so high. Even people I trust are likely to make a lot of miscalculations. And the epistemics can be bad.
I wouldn’t be excited about (for example) retreats with undergrads to learn about “how you can help buy more time.” I’m not even sure of the sign of interventions people I trust pursue, let alone people with no experience who have barely thought about the problem. As somebody who is very inexperienced but interested in AI strategy, I will note that you do have to start somewhere to learn anything.
That said—and I don’t think you dispute this—we cannot afford to leave strategy/fieldbuilding/policy off the table. In my view, a huge part of making AI go well is going to depend on forces beyond the purely technical, but I understand that this varies depending on one’s threat models. Societechnical systems are very difficult to influence, and it’s easy to influence them in an incorrect direction. Not everyone should try to do this, and even people who are really smart and thinking clearly may have negative effects. But I think we still have to try.
I want to push back on your suggestion that safety culture is not relevant. I agree that being vaguely “concerned” does seem not very useful. But safety culture seems very important. Things like (paraphrased from this paper):
Preoccupation with failure, especially black swan events and unseen failures.
Reluctance to simplify interpretations and explain failures using only simplistic narratives.
Sensitivity to operations, which involves closely monitoring systems for unexpected behavior.
Commitment to resilience, which means being rapidly adaptable to change and willing to try new ideas when faced with unexpected circumstances.
Under-specification of organizational structures, where new information can travel throughout the entire organization rather than relying only on fixed reporting chains.
In my view, having this kind of culture (which is analagous to having a security mindset) proliferate would be a nearly unalloyed good. Notice there is nothing here that necessarily says “slow down”—one failure mode with telling the most safety conscious people to simply slow down is, of course, that less safety conscious people don’t. Rather, simply awareness and understanding, and taking safety seriously, I think is robustly positive regardless of the strategic situation. Doing as much as possible to extract out the old fashioned “move fast and break things” ethos and replace it with a safety-oriented ethos would be very helpful (though I will emphasize, will not solve everything).
Lastly, regarding this:
I’m curious if you think that these topics have no relevance to alignment, or whether you’re saying it’s problematic that people come away thinking that these things are the whole picture? Because I see all of these as relevant (to varying degrees), but certainly not sufficient or the whole picture.
Sorry, I don’t want to say that safety culture is not relevant, but I want to say something like… “the real safety culture is just ‘sanity’”.
Like, I think you are largely playing a losing game if you try to get people to care about “safety” if the people making decisions are incapably of considering long-chained arguments, or are in incentives structures that make it inherently difficult to consider new arguments, or like, belief that humanity cannot go extinct because that would go against the will of god.
Like, I think I am excited about people approaching the ML field with the perspective of “how can I make these people more sane, and help them see the effects of their actions more clearly”, and less excited about “how can I make people say that they support my efforts, give resources to my tribe, and say lots of applause lights that are associated with my ingroup”. I am quite worried we are doing a lot of the latter, and little of the former (I do think that we should also be a bit hesitant about the former, since marginal sanity might people mostly just better at making AGI, though I think the tradeoff is worth it). I think it’s actually really hard to have a community whose advocacy doesn’t primarily end up oriented around being self-recommending.
I think none of these topics have almost any relevance to AI Alignment. Like, I do want my AI in some sense to learn moral philosophy, but having it classify sentences as “utilitarian” or “deontological” feels completely useless to me. I also think OpenAIs large language model efforts have approximately nothing to do with actual alignment, and indeed I think are mostly doing the opposite of alignment (training systems to build human models, become agentic, and be incentivized towards deception, though I recognize this is not a consensus view in the AI Alignment field). The last one is just a common misunderstanding and like, a fine one to clear up, but of course I think if someone comes away from our materials believing that, then that’s pretty terrible since it will likely further emboldem them to mostly just run ahead with AI.
I think I essentially agree with respect to your definition of “sanity,” and that it should be a goal. For example, just getting people to think more about tail risk seems like your definition of “sanity” and my definition of “safety culture.” I agree that saying that they support my efforts and say applause lights is pretty bad, though it seems weird to me to discount actual resources coming in.
As for the last bit: trying to figure out the crux here. Are you just not very concerned about outer alignment/proxy gaming? I think if I was totally unconcerned about that, and only concerned with inner alignment/deception, I would think those areas were useless. As it is, I think a lot of the work is actively harmful (because it is mostly just advances capabilities) but it still may help chip away at the proxy gaming problem.
I think RLHF doesn’t make progress on outer alignment (like, it directly creates a reward for deceiving humans). I think the case for RLHF can’t route through solving outer alignment in the limit, but has to route through somehow making AIs more useful as supervisors or as researchers for solving the rest of the AI Alignment problem.
Like, I don’t see how RLHF “chips away at the proxy gaming problem”. In some sense it makes the problem much harder since you are directly pushing on deception as the limit to the proxy gaming problem, which is the worst case for the problem.
I’m not a big supporter of RLHF myself, but my steelman is something like:
RLHF is a pretty general framework for conforming a system to optimize for something that can’t clearly be defined. If we just did, “if a human looked at this for a second would they like it?” this does provide a reward signal towards deception, but also towards genuinely useful behavior. You can take steps to reduce the deception component, for example by letting the humans red team the model or do some kind of transparency; this can all theoretically fit in the framework of RLHF. One could try to make the human feedback as robust as possible by adding all sorts of bells and whistles, and this would improve reliability and reduce the deception reward signal. It could be argued this still isn’t sufficient because the model will still find some way around it, but beware too much asymptotic reasoning.
I personally do not find RLHF very appealing since I think it complexifies things unnecessarily and is too correlated with capabilities at the moment. I prefer approaches to actually try to isolate the things people actually care about (i.e. their values), add some level of uncertainty (moral uncertainty, etc.) to it, try to make these proxies as robust as possible, and make them adaptive to changes in the models that are trying to constantly exploit and Goodhart them.
This has also motivated me to post one of my favorite critiques of RLHF.
I worry that people will skip the post, read this comment, and misunderstand the post, so I want to point out how this comment might be misleading, even though it’s a great comment.
None of the interventions in the post are “go work at OpenAI to change things from the inside.” And only the outreach ones sound anything like “going around and convincing others.” And there’s a disclaimer that these interventions have serious downside risks and require extremely competent execution.
EDIT: one idea in the 2nd post is to join safety and governance teams at top labs like OpenAI. This seems reasonable to me? (“Go work on capabilities at OpenAI to change things” would sound unreasonable.)