I want to push back on your suggestion that safety culture is not relevant. I agree that being vaguely “concerned” does seem not very useful. But safety culture seems very important. Things like (paraphrased from this paper):
Sorry, I don’t want to say that safety culture is not relevant, but I want to say something like… “the real safety culture is just ‘sanity’”.
Like, I think you are largely playing a losing game if you try to get people to care about “safety” if the people making decisions are incapably of considering long-chained arguments, or are in incentives structures that make it inherently difficult to consider new arguments, or like, belief that humanity cannot go extinct because that would go against the will of god.
Like, I think I am excited about people approaching the ML field with the perspective of “how can I make these people more sane, and help them see the effects of their actions more clearly”, and less excited about “how can I make people say that they support my efforts, give resources to my tribe, and say lots of applause lights that are associated with my ingroup”. I am quite worried we are doing a lot of the latter, and little of the former (I do think that we should also be a bit hesitant about the former, since marginal sanity might people mostly just better at making AGI, though I think the tradeoff is worth it). I think it’s actually really hard to have a community whose advocacy doesn’t primarily end up oriented around being self-recommending.
I’m curious if you think that these topics have no relevance to alignment, or whether you’re saying it’s problematic that people come away thinking that these things are the whole picture?
I think none of these topics have almost any relevance to AI Alignment. Like, I do want my AI in some sense to learn moral philosophy, but having it classify sentences as “utilitarian” or “deontological” feels completely useless to me. I also think OpenAIs large language model efforts have approximately nothing to do with actual alignment, and indeed I think are mostly doing the opposite of alignment (training systems to build human models, become agentic, and be incentivized towards deception, though I recognize this is not a consensus view in the AI Alignment field). The last one is just a common misunderstanding and like, a fine one to clear up, but of course I think if someone comes away from our materials believing that, then that’s pretty terrible since it will likely further emboldem them to mostly just run ahead with AI.
I think I essentially agree with respect to your definition of “sanity,” and that it should be a goal. For example, just getting people to think more about tail risk seems like your definition of “sanity” and my definition of “safety culture.” I agree that saying that they support my efforts and say applause lights is pretty bad, though it seems weird to me to discount actual resources coming in.
As for the last bit: trying to figure out the crux here. Are you just not very concerned about outer alignment/proxy gaming? I think if I was totally unconcerned about that, and only concerned with inner alignment/deception, I would think those areas were useless. As it is, I think a lot of the work is actively harmful (because it is mostly just advances capabilities) but it still may help chip away at the proxy gaming problem.
I think RLHF doesn’t make progress on outer alignment (like, it directly creates a reward for deceiving humans). I think the case for RLHF can’t route through solving outer alignment in the limit, but has to route through somehow making AIs more useful as supervisors or as researchers for solving the rest of the AI Alignment problem.
Like, I don’t see how RLHF “chips away at the proxy gaming problem”. In some sense it makes the problem much harder since you are directly pushing on deception as the limit to the proxy gaming problem, which is the worst case for the problem.
I’m not a big supporter of RLHF myself, but my steelman is something like:
RLHF is a pretty general framework for conforming a system to optimize for something that can’t clearly be defined. If we just did, “if a human looked at this for a second would they like it?” this does provide a reward signal towards deception, but also towards genuinely useful behavior. You can take steps to reduce the deception component, for example by letting the humans red team the model or do some kind of transparency; this can all theoretically fit in the framework of RLHF. One could try to make the human feedback as robust as possible by adding all sorts of bells and whistles, and this would improve reliability and reduce the deception reward signal. It could be argued this still isn’t sufficient because the model will still find some way around it, but beware too much asymptotic reasoning.
I personally do not find RLHF very appealing since I think it complexifies things unnecessarily and is too correlated with capabilities at the moment. I prefer approaches to actually try to isolate the things people actually care about (i.e. their values), add some level of uncertainty (moral uncertainty, etc.) to it, try to make these proxies as robust as possible, and make them adaptive to changes in the models that are trying to constantly exploit and Goodhart them.
Sorry, I don’t want to say that safety culture is not relevant, but I want to say something like… “the real safety culture is just ‘sanity’”.
Like, I think you are largely playing a losing game if you try to get people to care about “safety” if the people making decisions are incapably of considering long-chained arguments, or are in incentives structures that make it inherently difficult to consider new arguments, or like, belief that humanity cannot go extinct because that would go against the will of god.
Like, I think I am excited about people approaching the ML field with the perspective of “how can I make these people more sane, and help them see the effects of their actions more clearly”, and less excited about “how can I make people say that they support my efforts, give resources to my tribe, and say lots of applause lights that are associated with my ingroup”. I am quite worried we are doing a lot of the latter, and little of the former (I do think that we should also be a bit hesitant about the former, since marginal sanity might people mostly just better at making AGI, though I think the tradeoff is worth it). I think it’s actually really hard to have a community whose advocacy doesn’t primarily end up oriented around being self-recommending.
I think none of these topics have almost any relevance to AI Alignment. Like, I do want my AI in some sense to learn moral philosophy, but having it classify sentences as “utilitarian” or “deontological” feels completely useless to me. I also think OpenAIs large language model efforts have approximately nothing to do with actual alignment, and indeed I think are mostly doing the opposite of alignment (training systems to build human models, become agentic, and be incentivized towards deception, though I recognize this is not a consensus view in the AI Alignment field). The last one is just a common misunderstanding and like, a fine one to clear up, but of course I think if someone comes away from our materials believing that, then that’s pretty terrible since it will likely further emboldem them to mostly just run ahead with AI.
I think I essentially agree with respect to your definition of “sanity,” and that it should be a goal. For example, just getting people to think more about tail risk seems like your definition of “sanity” and my definition of “safety culture.” I agree that saying that they support my efforts and say applause lights is pretty bad, though it seems weird to me to discount actual resources coming in.
As for the last bit: trying to figure out the crux here. Are you just not very concerned about outer alignment/proxy gaming? I think if I was totally unconcerned about that, and only concerned with inner alignment/deception, I would think those areas were useless. As it is, I think a lot of the work is actively harmful (because it is mostly just advances capabilities) but it still may help chip away at the proxy gaming problem.
I think RLHF doesn’t make progress on outer alignment (like, it directly creates a reward for deceiving humans). I think the case for RLHF can’t route through solving outer alignment in the limit, but has to route through somehow making AIs more useful as supervisors or as researchers for solving the rest of the AI Alignment problem.
Like, I don’t see how RLHF “chips away at the proxy gaming problem”. In some sense it makes the problem much harder since you are directly pushing on deception as the limit to the proxy gaming problem, which is the worst case for the problem.
I’m not a big supporter of RLHF myself, but my steelman is something like:
RLHF is a pretty general framework for conforming a system to optimize for something that can’t clearly be defined. If we just did, “if a human looked at this for a second would they like it?” this does provide a reward signal towards deception, but also towards genuinely useful behavior. You can take steps to reduce the deception component, for example by letting the humans red team the model or do some kind of transparency; this can all theoretically fit in the framework of RLHF. One could try to make the human feedback as robust as possible by adding all sorts of bells and whistles, and this would improve reliability and reduce the deception reward signal. It could be argued this still isn’t sufficient because the model will still find some way around it, but beware too much asymptotic reasoning.
I personally do not find RLHF very appealing since I think it complexifies things unnecessarily and is too correlated with capabilities at the moment. I prefer approaches to actually try to isolate the things people actually care about (i.e. their values), add some level of uncertainty (moral uncertainty, etc.) to it, try to make these proxies as robust as possible, and make them adaptive to changes in the models that are trying to constantly exploit and Goodhart them.
This has also motivated me to post one of my favorite critiques of RLHF.