Epistemic status: Quick rant trying to get a bunch of intuitions and generators across, written in a very short time. Probably has some opinions expressed too strongly in various parts.
I appreciate seeing this post written, but currently think that if more people follow the advice in this post, this would make the world a lot worse (This is interesting because I personally think that at least for me, doing a bunch of “buying time” interventions are top candidates for what I think I should be doing with my life) I have a few different reasons for this:
I think organizing a group around political action is much more fraught than organizing a group around a set of shared epistemic virtues and confusions, and I expect a community that spent most of its time on something much closer to political advocacy would very quickly go insane. I think especially young people should really try to not spend their first few years after getting interested in the future of humanity going around and convincing others. I think that’s a terrible epistemic environment in which to engage honestly with the ideas.
I think the downside risk of most of these interventions is pretty huge, mostly because of effects on epistemics and morals (I don’t care that much about e.g. annoying capabilities researchers or something). I think a lot of these interventions have tempting paths where you exaggerate or lie or generally do things that are immoral in order to acquire political power, and I think this will both predictably cause a lot of high-integrity people to feel alienated and will cause the definitions and ontologies around AI alignment to get muddled, both within our own minds and the mind of the public.
Separately, many interventions in this space actually just make timelines shorter. I think sadly on the margin going around and trying to tell people about how dangerous and powerful AGI is going to be seems to mostly cause people to be even more interested in participating in its development, so that they can have a seat at the table when it comes to the discussion of “how to use this technology” (despite by far the biggest risk being accident risk, and it really not mattering very much for what ends we will try to use the technology)
I think making the world more broadly “safety concerned” does not actually help very much with causing people to go slower, or making saner decisions around AI. I think the dynamics here are extremely complicated and we have many examples of social movements and institutions that ended up achieving the opposite of their intended goals as the movement grew (e.g. environmentalists being the primary reason for the pushback against nuclear power).
The default of spreading vague “safety concern” memes is that people will start caring a lot about “who has control over the AI” and a lot of tribal infighting will happen, and decisions around AI will become worse. Nuclear power did not get safer when the world tried to regulate the hell out of how to design nuclear power plants (and in general the reliability literature suggests that trying to legislate reliability protocols does not work and is harmful). A world where the AIs are “legislated to be transparent” probably implies a world of less transparent AIs because the legislations for transparency will be crazy and dumb and not actually help much with understanding what is going on inside of your AIs.
Historically some strategies in this space have involved a lot of really talented people working at capabilities labs in order to “gain influence and trust”. I think the primary effect of this has also been to build AGI faster, with very limited success at actually getting these companies on a trajectory that will not have a very high chance of developing unaligned AI, or even getting any kind of long-term capital.
I think one of the primary effects of trying to do more outreach to ML-researchers has been a lot of people distorting the arguments in AI Alignment into a format that can somehow fit into ML papers and the existing ontology of ML researchers. I think this has somewhat reliably produced terrible papers with terrible pedagogy and has then caused many people to become actively confused about what actually makes AIs safe (with people walking away thinking that AI Alignment people want to teach AIs how to reproduce moral philosophy, or that OpenAIs large language model have been successfully “aligned”, or that we just need to throw some RLHF at the problem and the AI will learn our values fine). I am worried about seeing more of this, and I think this will overall make our job of actually getting humanity to sensibly relate to AI harder, not easier.
I think all of these problems can be overcome, and have been overcome by many people in the community, but I predict that the vast majority of efforts, especially by people who are not in the top percentiles of thinking about AI, and who avoided building detailed models of the space because they went into “buying time” interventions like the ones above, will result in making things worse in the way I listed above.
That said, I think I am actually excited about more people pursuing “buying time” interventions, but I don’t think trying to send the whole gigantic EA/longtermist community onto this task is going to be particularly fruitful. In contrast to your graph above, my graph for the payoff by percentile-of-talent for interventions in the space looks more like this:
My guess is maybe some people around the 70th percentile or so should work harder on buying us time, but I find myself very terrified of a 10,000 person community all somehow trying to pursue interventions like the ones you list above.
I appreciate this comment, because I think anyone who is trying to do these kinds of interventions needs to be constantly vigilant about exactly what you are mentioning. I am not excited about loads of inexperienced people suddenly trying to suddenly do big things in AI strategy, because downsides can be so high. Even people I trust are likely to make a lot of miscalculations. And the epistemics can be bad.
I wouldn’t be excited about (for example) retreats with undergrads to learn about “how you can help buy more time.” I’m not even sure of the sign of interventions people I trust pursue, let alone people with no experience who have barely thought about the problem. As somebody who is very inexperienced but interested in AI strategy, I will note that you do have to start somewhere to learn anything.
That said—and I don’t think you dispute this—we cannot afford to leave strategy/fieldbuilding/policy off the table. In my view, a huge part of making AI go well is going to depend on forces beyond the purely technical, but I understand that this varies depending on one’s threat models. Societechnical systems are very difficult to influence, and it’s easy to influence them in an incorrect direction. Not everyone should try to do this, and even people who are really smart and thinking clearly may have negative effects. But I think we still have to try.
I want to push back on your suggestion that safety culture is not relevant. I agree that being vaguely “concerned” does seem not very useful. But safety culture seems very important. Things like (paraphrased from this paper):
Preoccupation with failure, especially black swan events and unseen failures.
Reluctance to simplify interpretations and explain failures using only simplistic narratives.
Sensitivity to operations, which involves closely monitoring systems for unexpected behavior.
Commitment to resilience, which means being rapidly adaptable to change and willing to try new ideas when faced with unexpected circumstances.
Under-specification of organizational structures, where new information can travel throughout the entire organization rather than relying only on fixed reporting chains.
In my view, having this kind of culture (which is analagous to having a security mindset) proliferate would be a nearly unalloyed good. Notice there is nothing here that necessarily says “slow down”—one failure mode with telling the most safety conscious people to simply slow down is, of course, that less safety conscious people don’t. Rather, simply awareness and understanding, and taking safety seriously, I think is robustly positive regardless of the strategic situation. Doing as much as possible to extract out the old fashioned “move fast and break things” ethos and replace it with a safety-oriented ethos would be very helpful (though I will emphasize, will not solve everything).
Lastly, regarding this:
with people walking away thinking that AI Alignment people want to teach AIs how to reproduce moral philosophy, or that OpenAIs large language model have been successfully “aligned”, or that we just need to throw some RLHF at the problem and the AI will learn our values fine
I’m curious if you think that these topics have no relevance to alignment, or whether you’re saying it’s problematic that people come away thinking that these things are the whole picture? Because I see all of these as relevant (to varying degrees), but certainly not sufficient or the whole picture.
I want to push back on your suggestion that safety culture is not relevant. I agree that being vaguely “concerned” does seem not very useful. But safety culture seems very important. Things like (paraphrased from this paper):
Sorry, I don’t want to say that safety culture is not relevant, but I want to say something like… “the real safety culture is just ‘sanity’”.
Like, I think you are largely playing a losing game if you try to get people to care about “safety” if the people making decisions are incapably of considering long-chained arguments, or are in incentives structures that make it inherently difficult to consider new arguments, or like, belief that humanity cannot go extinct because that would go against the will of god.
Like, I think I am excited about people approaching the ML field with the perspective of “how can I make these people more sane, and help them see the effects of their actions more clearly”, and less excited about “how can I make people say that they support my efforts, give resources to my tribe, and say lots of applause lights that are associated with my ingroup”. I am quite worried we are doing a lot of the latter, and little of the former (I do think that we should also be a bit hesitant about the former, since marginal sanity might people mostly just better at making AGI, though I think the tradeoff is worth it). I think it’s actually really hard to have a community whose advocacy doesn’t primarily end up oriented around being self-recommending.
I’m curious if you think that these topics have no relevance to alignment, or whether you’re saying it’s problematic that people come away thinking that these things are the whole picture?
I think none of these topics have almost any relevance to AI Alignment. Like, I do want my AI in some sense to learn moral philosophy, but having it classify sentences as “utilitarian” or “deontological” feels completely useless to me. I also think OpenAIs large language model efforts have approximately nothing to do with actual alignment, and indeed I think are mostly doing the opposite of alignment (training systems to build human models, become agentic, and be incentivized towards deception, though I recognize this is not a consensus view in the AI Alignment field). The last one is just a common misunderstanding and like, a fine one to clear up, but of course I think if someone comes away from our materials believing that, then that’s pretty terrible since it will likely further emboldem them to mostly just run ahead with AI.
I think I essentially agree with respect to your definition of “sanity,” and that it should be a goal. For example, just getting people to think more about tail risk seems like your definition of “sanity” and my definition of “safety culture.” I agree that saying that they support my efforts and say applause lights is pretty bad, though it seems weird to me to discount actual resources coming in.
As for the last bit: trying to figure out the crux here. Are you just not very concerned about outer alignment/proxy gaming? I think if I was totally unconcerned about that, and only concerned with inner alignment/deception, I would think those areas were useless. As it is, I think a lot of the work is actively harmful (because it is mostly just advances capabilities) but it still may help chip away at the proxy gaming problem.
I think RLHF doesn’t make progress on outer alignment (like, it directly creates a reward for deceiving humans). I think the case for RLHF can’t route through solving outer alignment in the limit, but has to route through somehow making AIs more useful as supervisors or as researchers for solving the rest of the AI Alignment problem.
Like, I don’t see how RLHF “chips away at the proxy gaming problem”. In some sense it makes the problem much harder since you are directly pushing on deception as the limit to the proxy gaming problem, which is the worst case for the problem.
I’m not a big supporter of RLHF myself, but my steelman is something like:
RLHF is a pretty general framework for conforming a system to optimize for something that can’t clearly be defined. If we just did, “if a human looked at this for a second would they like it?” this does provide a reward signal towards deception, but also towards genuinely useful behavior. You can take steps to reduce the deception component, for example by letting the humans red team the model or do some kind of transparency; this can all theoretically fit in the framework of RLHF. One could try to make the human feedback as robust as possible by adding all sorts of bells and whistles, and this would improve reliability and reduce the deception reward signal. It could be argued this still isn’t sufficient because the model will still find some way around it, but beware too much asymptotic reasoning.
I personally do not find RLHF very appealing since I think it complexifies things unnecessarily and is too correlated with capabilities at the moment. I prefer approaches to actually try to isolate the things people actually care about (i.e. their values), add some level of uncertainty (moral uncertainty, etc.) to it, try to make these proxies as robust as possible, and make them adaptive to changes in the models that are trying to constantly exploit and Goodhart them.
I worry that people will skip the post, read this comment, and misunderstand the post, so I want to point out how this comment might be misleading, even though it’s a great comment.
None of the interventions in the post are “go work at OpenAI to change things from the inside.” And only the outreach ones sound anything like “going around and convincing others.” And there’s a disclaimer that these interventions have serious downside risks and require extremely competent execution.
EDIT: one idea in the 2nd post is to join safety and governance teams at top labs like OpenAI. This seems reasonable to me? (“Go work on capabilities at OpenAI to change things” would sound unreasonable.)
I think organizing a group around political action is much more fraught than organizing a group around a set of shared epistemic virtues and confusions, and I expect a community that spent most of its time on something much closer to political advocacy would very quickly go insane. I think especially young people should really try to not spend their first few years after getting interested in the future of humanity going around and convincing others. I think that’s a terrible epistemic environment in which to engage honestly with the ideas.
I think the downside risk of most of these interventions is pretty huge, mostly because of effects on epistemics and morals (I don’t care that much about e.g. annoying capabilities researchers or something). I think a lot of these interventions have tempting paths where you exaggerate or lie or generally do things that are immoral in order to acquire political power, and I think this will both predictably cause a lot of high-integrity people to feel alienated and will cause the definitions and ontologies around AI alignment to get muddled, both within our own minds and the mind of the public.
I agree truth matters, but I have a question here: Why can’t we sacrifice a small amount of integrity and conventional morality in order to win political power, when the world is at stake? After all, we can resume it later, when the problem is solved.
As my favorite quote from Michael Vassar says: “First they came for our epistemology. And then they...well, we don’t really know what happened next”.
But some more detailed arguments:
In-particular if you believe in slow-takeoff worlds, a lot of the future of the world rests on our ability to stay sane when the world turns crazy. I think in most slow-takeoff worlds we are going to see a lot more things that make the world at least as crazy and tumultuous as during COVID, and I think our rationality was indeed not strong enough to really handle COVID gracefully (we successfully noticed it was happening before the rest of the world, but then fucked up by being locked down for far too long and too strictly when it was no longer worth it).
At the end of the day, you also just have to solve the AI Alignment problem, and I think that is the kind of problem where you should look very skeptically at trading off marginal sanity. I sure see a lot of people sliding off the problem and instead just rationalize doing capabilities research instead, which sure seems like a crazy error to me, but shows that even quite smart people with a bit less sanity are tempted to do pretty crazy things in this domain (or alternatively, if they are right, that a lot of the best things to do are really counterintuitive and require some galaxy-brain level thinking).
I think especially when you are getting involved in political domains and are sitting on billions of dollars of redirectable money and thousands of extremely bright redirectable young people, you will get a lot of people trying to redirect your resources towards their aims (or you yourself will feel the temptation that you can get a lot more resources for yourself if you just act a bit more adversarially). I think resistance against adversaries (at least for the kind of thing that we are trying to do) gets a lot worse if you lose marginal sanity. People can see each others reasoning being faulty, this reduces trust, which increases adversarialness, which reduces trust, etc.
And then there is also just the classical “Ghandi murder pill” argument where it sure seems that slightly less sane people care less about their sanity. I think the arguments for staying sane and honest and clear-viewed yourself are actually pretty subtle, and less sanity I think can easily send us off on a slope of trading away more sanity for more power. I think this is a pretty standard pathway that you can read about in lots of books and has happened in lots of institutions.
I think one of the foremost questions I and many people ask when deciding who to allocate political power to is “will this person abuse their position” (or relatedly “will the person follow-through on their stated principles when I cannot see their behavior”), and only secondarily “is this person the most competent for it or the most intelligent person with it”. Insofar as this is typical, in an iterated game you should act as someone who can be given political power without concern about whether you will abuse it, if you would like to be given it at all.
I tend to believe that, if I’m ever in a situation where I feel that I might want to trade ethics/integrity to get what I want, instead, if I am smarter or work harder for a few more months, I will be able to get it without making any such sacrifices, and this is better because ethical people with integrity will continue to trust and work with me.
A related way I think about this is that ethical people with integrity work together, but don’t want to work with people who don’t have ethics or integrity. For example I know someone who once deceived their manager, to get their job done (cf. Moral Mazes). Until I see this person repent or produce credible costly signals to the contrary, I will not give this person much resources or work with them.
That said ‘conventional’ ethics, i.e. the current conventions, include things like recycling and not asking people out on dates in the same company, and I already don’t think these are actually involved in ethical behavior, so I’ve no truck with dismissing those.
“whether you will be caught abusing it” not “whether you will abuse it”. For certain kinds of actions it is possible to reliably evade detection and factor in the fact that you can reliably evade detection.
I don’t know who you have in your life, but in my life there is a marked difference between the people who clearly care about integrity, who clearly care about following the wishes of others when they have power over resources that they in some way owe to others, and those who do not (e.g. spending an hour thinking through the question “Hm, now that John let me stay at his house while he’s away, how would he want me to treat it?”). The cognition such people run would be quite costly to run differently at the specific time to notice that it’s worth it to take adversarial action, and they pay a heavy cost doing it in all the times I end up being able to check. The people whose word means something are clear to me, and my agreements with them are more forthcoming and simple than with others.
If an organisation reliably accepts certains forms of corruption, the current leadership may want people who engage in those forms of corruption to be given power and brought into leadership.
It is my experience that those with integrity are indeed not the sorts of people that people in corrupt organizations want to employ, they do not perform as the employers wish, and get caught up in internal conflicts.
And it is possible if you create a landscape with different incentives people won’t fall back to their old behaviours but show new ones instead.
I do have a difference in my mind between “People who I trust to be honest and follow through on commitments in the current, specific incentive landscape” and “People who I trust to be honest and follow through on commitments in a wide variety of incentive landscapes”.
Epistemic status: Quick rant trying to get a bunch of intuitions and generators across, written in a very short time. Probably has some opinions expressed too strongly in various parts.
I appreciate seeing this post written, but currently think that if more people follow the advice in this post, this would make the world a lot worse (This is interesting because I personally think that at least for me, doing a bunch of “buying time” interventions are top candidates for what I think I should be doing with my life) I have a few different reasons for this:
I think organizing a group around political action is much more fraught than organizing a group around a set of shared epistemic virtues and confusions, and I expect a community that spent most of its time on something much closer to political advocacy would very quickly go insane. I think especially young people should really try to not spend their first few years after getting interested in the future of humanity going around and convincing others. I think that’s a terrible epistemic environment in which to engage honestly with the ideas.
I think the downside risk of most of these interventions is pretty huge, mostly because of effects on epistemics and morals (I don’t care that much about e.g. annoying capabilities researchers or something). I think a lot of these interventions have tempting paths where you exaggerate or lie or generally do things that are immoral in order to acquire political power, and I think this will both predictably cause a lot of high-integrity people to feel alienated and will cause the definitions and ontologies around AI alignment to get muddled, both within our own minds and the mind of the public.
Separately, many interventions in this space actually just make timelines shorter. I think sadly on the margin going around and trying to tell people about how dangerous and powerful AGI is going to be seems to mostly cause people to be even more interested in participating in its development, so that they can have a seat at the table when it comes to the discussion of “how to use this technology” (despite by far the biggest risk being accident risk, and it really not mattering very much for what ends we will try to use the technology)
I think making the world more broadly “safety concerned” does not actually help very much with causing people to go slower, or making saner decisions around AI. I think the dynamics here are extremely complicated and we have many examples of social movements and institutions that ended up achieving the opposite of their intended goals as the movement grew (e.g. environmentalists being the primary reason for the pushback against nuclear power).
The default of spreading vague “safety concern” memes is that people will start caring a lot about “who has control over the AI” and a lot of tribal infighting will happen, and decisions around AI will become worse. Nuclear power did not get safer when the world tried to regulate the hell out of how to design nuclear power plants (and in general the reliability literature suggests that trying to legislate reliability protocols does not work and is harmful). A world where the AIs are “legislated to be transparent” probably implies a world of less transparent AIs because the legislations for transparency will be crazy and dumb and not actually help much with understanding what is going on inside of your AIs.
Historically some strategies in this space have involved a lot of really talented people working at capabilities labs in order to “gain influence and trust”. I think the primary effect of this has also been to build AGI faster, with very limited success at actually getting these companies on a trajectory that will not have a very high chance of developing unaligned AI, or even getting any kind of long-term capital.
I think one of the primary effects of trying to do more outreach to ML-researchers has been a lot of people distorting the arguments in AI Alignment into a format that can somehow fit into ML papers and the existing ontology of ML researchers. I think this has somewhat reliably produced terrible papers with terrible pedagogy and has then caused many people to become actively confused about what actually makes AIs safe (with people walking away thinking that AI Alignment people want to teach AIs how to reproduce moral philosophy, or that OpenAIs large language model have been successfully “aligned”, or that we just need to throw some RLHF at the problem and the AI will learn our values fine). I am worried about seeing more of this, and I think this will overall make our job of actually getting humanity to sensibly relate to AI harder, not easier.
I think all of these problems can be overcome, and have been overcome by many people in the community, but I predict that the vast majority of efforts, especially by people who are not in the top percentiles of thinking about AI, and who avoided building detailed models of the space because they went into “buying time” interventions like the ones above, will result in making things worse in the way I listed above.
That said, I think I am actually excited about more people pursuing “buying time” interventions, but I don’t think trying to send the whole gigantic EA/longtermist community onto this task is going to be particularly fruitful. In contrast to your graph above, my graph for the payoff by percentile-of-talent for interventions in the space looks more like this:
My guess is maybe some people around the 70th percentile or so should work harder on buying us time, but I find myself very terrified of a 10,000 person community all somehow trying to pursue interventions like the ones you list above.
I appreciate this comment, because I think anyone who is trying to do these kinds of interventions needs to be constantly vigilant about exactly what you are mentioning. I am not excited about loads of inexperienced people suddenly trying to suddenly do big things in AI strategy, because downsides can be so high. Even people I trust are likely to make a lot of miscalculations. And the epistemics can be bad.
I wouldn’t be excited about (for example) retreats with undergrads to learn about “how you can help buy more time.” I’m not even sure of the sign of interventions people I trust pursue, let alone people with no experience who have barely thought about the problem. As somebody who is very inexperienced but interested in AI strategy, I will note that you do have to start somewhere to learn anything.
That said—and I don’t think you dispute this—we cannot afford to leave strategy/fieldbuilding/policy off the table. In my view, a huge part of making AI go well is going to depend on forces beyond the purely technical, but I understand that this varies depending on one’s threat models. Societechnical systems are very difficult to influence, and it’s easy to influence them in an incorrect direction. Not everyone should try to do this, and even people who are really smart and thinking clearly may have negative effects. But I think we still have to try.
I want to push back on your suggestion that safety culture is not relevant. I agree that being vaguely “concerned” does seem not very useful. But safety culture seems very important. Things like (paraphrased from this paper):
Preoccupation with failure, especially black swan events and unseen failures.
Reluctance to simplify interpretations and explain failures using only simplistic narratives.
Sensitivity to operations, which involves closely monitoring systems for unexpected behavior.
Commitment to resilience, which means being rapidly adaptable to change and willing to try new ideas when faced with unexpected circumstances.
Under-specification of organizational structures, where new information can travel throughout the entire organization rather than relying only on fixed reporting chains.
In my view, having this kind of culture (which is analagous to having a security mindset) proliferate would be a nearly unalloyed good. Notice there is nothing here that necessarily says “slow down”—one failure mode with telling the most safety conscious people to simply slow down is, of course, that less safety conscious people don’t. Rather, simply awareness and understanding, and taking safety seriously, I think is robustly positive regardless of the strategic situation. Doing as much as possible to extract out the old fashioned “move fast and break things” ethos and replace it with a safety-oriented ethos would be very helpful (though I will emphasize, will not solve everything).
Lastly, regarding this:
I’m curious if you think that these topics have no relevance to alignment, or whether you’re saying it’s problematic that people come away thinking that these things are the whole picture? Because I see all of these as relevant (to varying degrees), but certainly not sufficient or the whole picture.
Sorry, I don’t want to say that safety culture is not relevant, but I want to say something like… “the real safety culture is just ‘sanity’”.
Like, I think you are largely playing a losing game if you try to get people to care about “safety” if the people making decisions are incapably of considering long-chained arguments, or are in incentives structures that make it inherently difficult to consider new arguments, or like, belief that humanity cannot go extinct because that would go against the will of god.
Like, I think I am excited about people approaching the ML field with the perspective of “how can I make these people more sane, and help them see the effects of their actions more clearly”, and less excited about “how can I make people say that they support my efforts, give resources to my tribe, and say lots of applause lights that are associated with my ingroup”. I am quite worried we are doing a lot of the latter, and little of the former (I do think that we should also be a bit hesitant about the former, since marginal sanity might people mostly just better at making AGI, though I think the tradeoff is worth it). I think it’s actually really hard to have a community whose advocacy doesn’t primarily end up oriented around being self-recommending.
I think none of these topics have almost any relevance to AI Alignment. Like, I do want my AI in some sense to learn moral philosophy, but having it classify sentences as “utilitarian” or “deontological” feels completely useless to me. I also think OpenAIs large language model efforts have approximately nothing to do with actual alignment, and indeed I think are mostly doing the opposite of alignment (training systems to build human models, become agentic, and be incentivized towards deception, though I recognize this is not a consensus view in the AI Alignment field). The last one is just a common misunderstanding and like, a fine one to clear up, but of course I think if someone comes away from our materials believing that, then that’s pretty terrible since it will likely further emboldem them to mostly just run ahead with AI.
I think I essentially agree with respect to your definition of “sanity,” and that it should be a goal. For example, just getting people to think more about tail risk seems like your definition of “sanity” and my definition of “safety culture.” I agree that saying that they support my efforts and say applause lights is pretty bad, though it seems weird to me to discount actual resources coming in.
As for the last bit: trying to figure out the crux here. Are you just not very concerned about outer alignment/proxy gaming? I think if I was totally unconcerned about that, and only concerned with inner alignment/deception, I would think those areas were useless. As it is, I think a lot of the work is actively harmful (because it is mostly just advances capabilities) but it still may help chip away at the proxy gaming problem.
I think RLHF doesn’t make progress on outer alignment (like, it directly creates a reward for deceiving humans). I think the case for RLHF can’t route through solving outer alignment in the limit, but has to route through somehow making AIs more useful as supervisors or as researchers for solving the rest of the AI Alignment problem.
Like, I don’t see how RLHF “chips away at the proxy gaming problem”. In some sense it makes the problem much harder since you are directly pushing on deception as the limit to the proxy gaming problem, which is the worst case for the problem.
I’m not a big supporter of RLHF myself, but my steelman is something like:
RLHF is a pretty general framework for conforming a system to optimize for something that can’t clearly be defined. If we just did, “if a human looked at this for a second would they like it?” this does provide a reward signal towards deception, but also towards genuinely useful behavior. You can take steps to reduce the deception component, for example by letting the humans red team the model or do some kind of transparency; this can all theoretically fit in the framework of RLHF. One could try to make the human feedback as robust as possible by adding all sorts of bells and whistles, and this would improve reliability and reduce the deception reward signal. It could be argued this still isn’t sufficient because the model will still find some way around it, but beware too much asymptotic reasoning.
I personally do not find RLHF very appealing since I think it complexifies things unnecessarily and is too correlated with capabilities at the moment. I prefer approaches to actually try to isolate the things people actually care about (i.e. their values), add some level of uncertainty (moral uncertainty, etc.) to it, try to make these proxies as robust as possible, and make them adaptive to changes in the models that are trying to constantly exploit and Goodhart them.
This has also motivated me to post one of my favorite critiques of RLHF.
I worry that people will skip the post, read this comment, and misunderstand the post, so I want to point out how this comment might be misleading, even though it’s a great comment.
None of the interventions in the post are “go work at OpenAI to change things from the inside.” And only the outreach ones sound anything like “going around and convincing others.” And there’s a disclaimer that these interventions have serious downside risks and require extremely competent execution.
EDIT: one idea in the 2nd post is to join safety and governance teams at top labs like OpenAI. This seems reasonable to me? (“Go work on capabilities at OpenAI to change things” would sound unreasonable.)
I agree truth matters, but I have a question here: Why can’t we sacrifice a small amount of integrity and conventional morality in order to win political power, when the world is at stake? After all, we can resume it later, when the problem is solved.
As my favorite quote from Michael Vassar says: “First they came for our epistemology. And then they...well, we don’t really know what happened next”.
But some more detailed arguments:
In-particular if you believe in slow-takeoff worlds, a lot of the future of the world rests on our ability to stay sane when the world turns crazy. I think in most slow-takeoff worlds we are going to see a lot more things that make the world at least as crazy and tumultuous as during COVID, and I think our rationality was indeed not strong enough to really handle COVID gracefully (we successfully noticed it was happening before the rest of the world, but then fucked up by being locked down for far too long and too strictly when it was no longer worth it).
At the end of the day, you also just have to solve the AI Alignment problem, and I think that is the kind of problem where you should look very skeptically at trading off marginal sanity. I sure see a lot of people sliding off the problem and instead just rationalize doing capabilities research instead, which sure seems like a crazy error to me, but shows that even quite smart people with a bit less sanity are tempted to do pretty crazy things in this domain (or alternatively, if they are right, that a lot of the best things to do are really counterintuitive and require some galaxy-brain level thinking).
I think especially when you are getting involved in political domains and are sitting on billions of dollars of redirectable money and thousands of extremely bright redirectable young people, you will get a lot of people trying to redirect your resources towards their aims (or you yourself will feel the temptation that you can get a lot more resources for yourself if you just act a bit more adversarially). I think resistance against adversaries (at least for the kind of thing that we are trying to do) gets a lot worse if you lose marginal sanity. People can see each others reasoning being faulty, this reduces trust, which increases adversarialness, which reduces trust, etc.
And then there is also just the classical “Ghandi murder pill” argument where it sure seems that slightly less sane people care less about their sanity. I think the arguments for staying sane and honest and clear-viewed yourself are actually pretty subtle, and less sanity I think can easily send us off on a slope of trading away more sanity for more power. I think this is a pretty standard pathway that you can read about in lots of books and has happened in lots of institutions.
A few quick thoughts:
I think one of the foremost questions I and many people ask when deciding who to allocate political power to is “will this person abuse their position” (or relatedly “will the person follow-through on their stated principles when I cannot see their behavior”), and only secondarily “is this person the most competent for it or the most intelligent person with it”. Insofar as this is typical, in an iterated game you should act as someone who can be given political power without concern about whether you will abuse it, if you would like to be given it at all.
I tend to believe that, if I’m ever in a situation where I feel that I might want to trade ethics/integrity to get what I want, instead, if I am smarter or work harder for a few more months, I will be able to get it without making any such sacrifices, and this is better because ethical people with integrity will continue to trust and work with me.
A related way I think about this is that ethical people with integrity work together, but don’t want to work with people who don’t have ethics or integrity. For example I know someone who once deceived their manager, to get their job done (cf. Moral Mazes). Until I see this person repent or produce credible costly signals to the contrary, I will not give this person much resources or work with them.
That said ‘conventional’ ethics, i.e. the current conventions, include things like recycling and not asking people out on dates in the same company, and I already don’t think these are actually involved in ethical behavior, so I’ve no truck with dismissing those.
I don’t know who you have in your life, but in my life there is a marked difference between the people who clearly care about integrity, who clearly care about following the wishes of others when they have power over resources that they in some way owe to others, and those who do not (e.g. spending an hour thinking through the question “Hm, now that John let me stay at his house while he’s away, how would he want me to treat it?”). The cognition such people run would be quite costly to run differently at the specific time to notice that it’s worth it to take adversarial action, and they pay a heavy cost doing it in all the times I end up being able to check. The people whose word means something are clear to me, and my agreements with them are more forthcoming and simple than with others.
It is my experience that those with integrity are indeed not the sorts of people that people in corrupt organizations want to employ, they do not perform as the employers wish, and get caught up in internal conflicts.
I do have a difference in my mind between “People who I trust to be honest and follow through on commitments in the current, specific incentive landscape” and “People who I trust to be honest and follow through on commitments in a wide variety of incentive landscapes”.