I thought it was worth commenting here because to me the 3 way debate with Eliezer Yudkowsky, Nora Belrose, and Andrew Critch managed to collectively touch on just about everything that I think the common debate gets wrong about AI “doom” with the result that they’re all overconfident in their respective positions.
Starting with Eliezer and Nora’s argument. Her statement:
“Alien shoggoths are about as likely to arise in neural networks as Boltzmann brains are to emerge from a thermal equilibrium.”
To which Eliezer responds,
“How blind to ‘try imagining literally any internal mechanism that isn’t the exact thing you hope for’ do you have to be—to think that, if you erase a brain, and then train that brain solely to predict the next word spoken by nice people, it ends up nice internally?”
I agree that it’s a mistake to identify niceness with predicting nice behaviour, and I agree that Nora is overconfident in no generalisation failures as a result of making a similar mistake. If your model says it’s literally as unlikely as a boltzmann brain appearing from nowhere then something has gone wrong. But, I don’t think that her point is as straightforward as just conflating a nice internal mechanism with nice feedback. I’m going to try and explain what I think her argument is.
I think that Eliezer has an implicit model that there’s zillions of potential generalizations to predict niceness which a model could learn, that all are pretty much equally likely to get learned a priori, and actually being nice is just one of them so it’s basically impossible for RLHF to hit on it, so RLHF would require tremendous cosmic coincidences to work.
Maybe this is true in some sense for arbitrarily superintelligent AI. But, as Paul Christiano said, I think that this tells us not much about what to expect for “somewhat superhuman” AI. Which is what we care about for predicting whether we’ll see misalignment disasters in practice.
Rather, “actually learning to be nice” is how humans usually learn to predict nice behaviour. Of all the possible ways that generalisation from nice training could happen, this is privileged as a hypothesis somewhat, it stands out from the background haze of random mechanisms that could be learned.
If the reasons this strategy worked for humans are transferable to the LLM case (and that is highly arguable and unclear), then yes, it might be true that giving agents rewards for being nice causes them to internally develop a sort of pseudo-niceness representation that controls their behaviour and planning even up to superhuman levels, even out of distribution. It’s not for ‘literally no reason’ or ‘by coincidence’ or ‘because of a map-territory conflation’, but because its possible such a mechanism in the form of a model inductive bias really exists and we have some vague evidence in favor of it.
Okay, so what’s the internal mechanism that I’m imagining which gets us there? Here’s a sketch, based on an “easy world” outlined in my alignment difficulty post.
Suppose that (up to some level of competence that’s notably superhuman for most engineering tasks), LLMs just search over potential writers of text, with RLHF selecting from the space of agents that have goals only over text completion. They can model the world, but since they start out modelling text, that’s what their goals range over, even up to considerably superhuman competence at a wide range of tasks. They don’t want things in the real world, and only model it to get more accurate text predictions. Therefore, you can just ask RLHF’d GPT-10, “what’s the permanent alignment solution?”, and it’ll tell you.
People still sometimes say, “doesn’t this require us to get unreasonably, impossibly lucky with generalisation?”. No, it requires luck but you can’t say it’s unbelievable impossible luck just based on not knowing how generalisation works. I also think recent evidence (LLMs getting better at modelling the world without developing goals over it) suggests this world is a bit more likely than it seemed years ago as Paul Christiano argues here:
“I think that a system may not even be able to “want” things in the behaviorist sense, and this is correlated with being unable to solve long-horizon tasks. So if you think that systems can’t want things or solve long horizon tasks at all, then maybe you shouldn’t update at all when they don’t appear to want things.”
But that’s not really where we are at—AI systems are able to do an increasingly good job of solving increasingly long-horizon tasks. So it just seems like it should obviously be an update, and the answer to the original question
Could you give an example of a task you don’t think AI systems will be able to do before they are “want”-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like “go to the moon” and that you will still be writing this kind of post even once AI systems have 10x’d the pace of R&D.)
But, again, I’m not making the claim that this favourable generalisation that gets RLHF to work is likely, just that it’s not a random complex hypothesis with no evidence for it that’s therefore near-impossible.
Since we don’t know how generalisation works, we can’t even say “we should have a uniform prior over internal mechanisms which I can describe that could get high reward”. Rather, if you don’t know, you really just don’t know, and the mechanism involving actually learning to be nice to predict niceness, or actually staying in the domain you were initially trained on when planning, might be favoured by inductive biases in training.
But even if you disagree with me on that, the supposed mistake is not (just) as simple as literally conflating the intent of the overseers with the goals that the AI learns, rather there’s a thought that replicating the goals that produced the feedback and simply adopting them as your own is a natural, simple way to learn to predict what the overseer wants even up to fairly superhuman capabilities so it’s what will get learned by default even if it isn’t the globally optimal reward-maximiser. Is this true? Well, I don’t know, but it’s at least a more complicated mistake if false. This point has been made many times in different contexts, there’s a summary discussion here that outlines 6 different presentations of this basic idea.
If I had to sum it up, I think that while Nora maybe confuses the map with the territory, Eliezer conflates ignorance with positive knowledge (from ‘we don’t know how generalisation works’ to ‘we should have a strong default uniform prior over every kind of mechanism we could name’).
Then there’s Andrew Critch, who I think agrees with and understands the point I’ve just made (that Nora’s argument is not a simple mistake of the map for the territory), but then makes a far more overreaching and unjustifiable claim than Eliezer or Nora in response.
In the Nora/Eliezer case, they were both very confident in their respective models of AI generalisation, which is at least the kind of thing about which you could be extremely confident, should you have strong evidence (which I don’t think we do). Social science and futurism is not one of those things. Critch says,
″ I think literally every human institution will probably fail or become fully dehumanized by sometime around (median) 2040.”
The “multipolar chaos” prediction, which is that processes like a fast proliferating production web will demolish or corrupt all institutional opposition and send us to dystopia with near-certainty, I just don’t buy.
I’ve read his production web stories and also heard similar arguments from many people, and it’s hard to voice my objections as specific “here’s why your story can’t happen” (I think many of them are at least somewhat plausible, in fact), but I still think there’s a major error of reasoning going on here. I think it’s related to the conjunction fallacy, the sleepwalk bias and possibly not wanting to come across as unreasonably optimistic about our institutions.
In the future, AI-driven management assistant software revolutionizes industries by automating decision-making processes, including “soft skills” like conflict resolution. This leads to massive job automation, even at high management levels. Companies that don’t adopt this technology fall behind. An interconnected “production web” of companies emerges, operating with minimal human intervention and focusing on maximizing production. They develop a self-sustaining economy, using digital currencies and operating beyond human regulatory reach. Over time, these companies, driven by their AI-optimized objectives, inadvertently prioritize their production goals over human welfare. This misalignment leads to the depletion of essential resources like arable land and drinking water, ultimately threatening human survival, as humanity becomes unable to influence or stop these autonomous corporate entities.
My object-level response is to say something mundane along the lines of, I think each of the following is more or less independent and not extremely unlikely to occur (each is above 1% likely):
Wouldn’t governments and regulators also have access to AI systems to aid with oversight and especially with predicting the future? Remember, in this world we have pseudo-aligned AI systems that will more or less do what their overseers want in the short term.
Couldn’t a political candidate ask their (aligned) strategist-AI ‘are we all going to be killed by this process in 20 years’ and then make a persuasive campaign to change the public’s mind with this early in the process, using obvious evidence to their advantage
If the world is alarmed by the expanding production web and governments have a lot of hard power initially, why will enforcement necessarily be ineffective? If there’s a shadow economy of digital payments, just arrest anyone found dealing with a rogue AI system. This would scare a lot of people.
What if the lead project is unitary and a singleton or the few lead projects quickly band together because they’re foresightful, so none of this race to the bottom stuff happens in the first place?
If it gets to the point where water or the oxygen in the atmosphere is being used up (why would that happen again, why wouldn’t it just be easier for the machines to fly off into space and not have to deal with the presumed disvalue of doing something their original overseers didn’t like?) did nobody build in ‘off switches’?
Even if they aren’t fulfilling our values perfectly, wouldn’t the production web just reach some equilibrium where it’s skimming off a small amount of resources to placate its overseers (since its various components are at least somewhat beholden to them) while expanding further and further?
And I already know the response is just going to be “Moloch wouldn’t let that happen..” and that eventually competition will mean that all of these barriers disappear. At this point though I think that such a response is too broad and proves too much. If you use the moloch idea this way it becomes the classic mistaken “one big idea universal theory of history” which can explain nearly any outcome so long as it doesn’t have to predict it.
A further point: I think that someone using this kind of reasoning in 1830 would have very confidently predicted that the world of 2023 would be a horrible dystopia where wages for workers wouldn’t have improved at all because of moloch.
I agree that it’s somewhat easier for me to write a realistic science fiction story set in 2045 that’s dystopian compared to utopian, assuming pseudo-aligned AGI and no wars or other obvious catastrophic misuse. As a broader point, I along with the great majority of people, don’t really want this transition to happen either way, and there are many aspects of the ‘mediocre/utopian’ futures that would be suboptimal, so I get why the future forecasts don’t ever look normal or low-risk.
But I think all this speculation tells us very little with confidence what the default future looks like. I don’t think a dystopian economic race to the bottom is extremely unlikely, and with Matthew Barnett I am worried about what values and interests will influence AI development and think the case for being concerned about whether our institutions will hold is strong.
But saying that moloch is a deterministic law of nature such that we can be near-certain of the outcome is not justifiable. This is not even the character of predictions about which you can have such certainty.
Also, in this case I think that a reference class/outside view objection that this resembles failed doomsday predictions of the past is warranted.
I don’t agree that these objections have much weight when we’re concerned about misaligned AI takeover as that has a clear, singular obvious mechanism to be worried about.
However, for ‘molochian race to the bottom multipolar chaos’, it does have the characteristic of ignoring or dismissing endogenous responses, society seeing what’s happening and deciding not to go down that path, or just unknown unknowns that we saw with past failed doomsday predictions. This I see as absolutely in the same reference class as people who in past decades were certain of overpopulation catastrophes or the people now who are certain or think a civilizational collapse from the effects of climate change are likely. It’s taking current trends and drawing mental straight lines on them to extreme heights decades in the future.
I thought it was worth commenting here because to me the 3 way debate with Eliezer Yudkowsky, Nora Belrose, and Andrew Critch managed to collectively touch on just about everything that I think the common debate gets wrong about AI “doom” with the result that they’re all overconfident in their respective positions.
Starting with Eliezer and Nora’s argument. Her statement:
To which Eliezer responds,
I agree that it’s a mistake to identify niceness with predicting nice behaviour, and I agree that Nora is overconfident in no generalisation failures as a result of making a similar mistake. If your model says it’s literally as unlikely as a boltzmann brain appearing from nowhere then something has gone wrong. But, I don’t think that her point is as straightforward as just conflating a nice internal mechanism with nice feedback. I’m going to try and explain what I think her argument is.
I think that Eliezer has an implicit model that there’s zillions of potential generalizations to predict niceness which a model could learn, that all are pretty much equally likely to get learned a priori, and actually being nice is just one of them so it’s basically impossible for RLHF to hit on it, so RLHF would require tremendous cosmic coincidences to work.
Maybe this is true in some sense for arbitrarily superintelligent AI. But, as Paul Christiano said, I think that this tells us not much about what to expect for “somewhat superhuman” AI. Which is what we care about for predicting whether we’ll see misalignment disasters in practice.
Rather, “actually learning to be nice” is how humans usually learn to predict nice behaviour. Of all the possible ways that generalisation from nice training could happen, this is privileged as a hypothesis somewhat, it stands out from the background haze of random mechanisms that could be learned.
If the reasons this strategy worked for humans are transferable to the LLM case (and that is highly arguable and unclear), then yes, it might be true that giving agents rewards for being nice causes them to internally develop a sort of pseudo-niceness representation that controls their behaviour and planning even up to superhuman levels, even out of distribution. It’s not for ‘literally no reason’ or ‘by coincidence’ or ‘because of a map-territory conflation’, but because its possible such a mechanism in the form of a model inductive bias really exists and we have some vague evidence in favor of it.
Okay, so what’s the internal mechanism that I’m imagining which gets us there? Here’s a sketch, based on an “easy world” outlined in my alignment difficulty post.
People still sometimes say, “doesn’t this require us to get unreasonably, impossibly lucky with generalisation?”. No, it requires luck but you can’t say it’s unbelievable impossible luck just based on not knowing how generalisation works. I also think recent evidence (LLMs getting better at modelling the world without developing goals over it) suggests this world is a bit more likely than it seemed years ago as Paul Christiano argues here:
But, again, I’m not making the claim that this favourable generalisation that gets RLHF to work is likely, just that it’s not a random complex hypothesis with no evidence for it that’s therefore near-impossible.
Since we don’t know how generalisation works, we can’t even say “we should have a uniform prior over internal mechanisms which I can describe that could get high reward”. Rather, if you don’t know, you really just don’t know, and the mechanism involving actually learning to be nice to predict niceness, or actually staying in the domain you were initially trained on when planning, might be favoured by inductive biases in training.
But even if you disagree with me on that, the supposed mistake is not (just) as simple as literally conflating the intent of the overseers with the goals that the AI learns, rather there’s a thought that replicating the goals that produced the feedback and simply adopting them as your own is a natural, simple way to learn to predict what the overseer wants even up to fairly superhuman capabilities so it’s what will get learned by default even if it isn’t the globally optimal reward-maximiser. Is this true? Well, I don’t know, but it’s at least a more complicated mistake if false. This point has been made many times in different contexts, there’s a summary discussion here that outlines 6 different presentations of this basic idea.
If I had to sum it up, I think that while Nora maybe confuses the map with the territory, Eliezer conflates ignorance with positive knowledge (from ‘we don’t know how generalisation works’ to ‘we should have a strong default uniform prior over every kind of mechanism we could name’).
Then there’s Andrew Critch, who I think agrees with and understands the point I’ve just made (that Nora’s argument is not a simple mistake of the map for the territory), but then makes a far more overreaching and unjustifiable claim than Eliezer or Nora in response.
In the Nora/Eliezer case, they were both very confident in their respective models of AI generalisation, which is at least the kind of thing about which you could be extremely confident, should you have strong evidence (which I don’t think we do). Social science and futurism is not one of those things. Critch says,
The “multipolar chaos” prediction, which is that processes like a fast proliferating production web will demolish or corrupt all institutional opposition and send us to dystopia with near-certainty, I just don’t buy.
I’ve read his production web stories and also heard similar arguments from many people, and it’s hard to voice my objections as specific “here’s why your story can’t happen” (I think many of them are at least somewhat plausible, in fact), but I still think there’s a major error of reasoning going on here. I think it’s related to the conjunction fallacy, the sleepwalk bias and possibly not wanting to come across as unreasonably optimistic about our institutions.
Here’s one of the production web stories in brief but you can read it in full along with my old discussion here,
My object-level response is to say something mundane along the lines of, I think each of the following is more or less independent and not extremely unlikely to occur (each is above 1% likely):
Wouldn’t governments and regulators also have access to AI systems to aid with oversight and especially with predicting the future? Remember, in this world we have pseudo-aligned AI systems that will more or less do what their overseers want in the short term.
Couldn’t a political candidate ask their (aligned) strategist-AI ‘are we all going to be killed by this process in 20 years’ and then make a persuasive campaign to change the public’s mind with this early in the process, using obvious evidence to their advantage
If the world is alarmed by the expanding production web and governments have a lot of hard power initially, why will enforcement necessarily be ineffective? If there’s a shadow economy of digital payments, just arrest anyone found dealing with a rogue AI system. This would scare a lot of people.
We’ve already seen pessimistic views about what AI regulations can achieve self-confessedly be falsified at the 98% level—there’s sleepwalk bias to consider. Stefan schubert: Yeah, if people think the policy response is “99th-percentile-in-2018”, then that suggests their models have been seriously wrong. So maybe the regulations will be both effective, foresightful and well implemented with AI systems forseeing the long-run consequences of decisions and backing them up.
What if the lead project is unitary and a singleton or the few lead projects quickly band together because they’re foresightful, so none of this race to the bottom stuff happens in the first place?
If it gets to the point where water or the oxygen in the atmosphere is being used up (why would that happen again, why wouldn’t it just be easier for the machines to fly off into space and not have to deal with the presumed disvalue of doing something their original overseers didn’t like?) did nobody build in ‘off switches’?
Even if they aren’t fulfilling our values perfectly, wouldn’t the production web just reach some equilibrium where it’s skimming off a small amount of resources to placate its overseers (since its various components are at least somewhat beholden to them) while expanding further and further?
And I already know the response is just going to be “Moloch wouldn’t let that happen..” and that eventually competition will mean that all of these barriers disappear. At this point though I think that such a response is too broad and proves too much. If you use the moloch idea this way it becomes the classic mistaken “one big idea universal theory of history” which can explain nearly any outcome so long as it doesn’t have to predict it.
A further point: I think that someone using this kind of reasoning in 1830 would have very confidently predicted that the world of 2023 would be a horrible dystopia where wages for workers wouldn’t have improved at all because of moloch.
I agree that it’s somewhat easier for me to write a realistic science fiction story set in 2045 that’s dystopian compared to utopian, assuming pseudo-aligned AGI and no wars or other obvious catastrophic misuse. As a broader point, I along with the great majority of people, don’t really want this transition to happen either way, and there are many aspects of the ‘mediocre/utopian’ futures that would be suboptimal, so I get why the future forecasts don’t ever look normal or low-risk.
But I think all this speculation tells us very little with confidence what the default future looks like. I don’t think a dystopian economic race to the bottom is extremely unlikely, and with Matthew Barnett I am worried about what values and interests will influence AI development and think the case for being concerned about whether our institutions will hold is strong.
But saying that moloch is a deterministic law of nature such that we can be near-certain of the outcome is not justifiable. This is not even the character of predictions about which you can have such certainty.
Also, in this case I think that a reference class/outside view objection that this resembles failed doomsday predictions of the past is warranted.
I don’t agree that these objections have much weight when we’re concerned about misaligned AI takeover as that has a clear, singular obvious mechanism to be worried about.
However, for ‘molochian race to the bottom multipolar chaos’, it does have the characteristic of ignoring or dismissing endogenous responses, society seeing what’s happening and deciding not to go down that path, or just unknown unknowns that we saw with past failed doomsday predictions. This I see as absolutely in the same reference class as people who in past decades were certain of overpopulation catastrophes or the people now who are certain or think a civilizational collapse from the effects of climate change are likely. It’s taking current trends and drawing mental straight lines on them to extreme heights decades in the future.