(Epistemic status: vague, ill-formed first impressions.)
So that’s what we’re doing, huh? I suppose EY/MIRI has reached the point where worrying about memetics / optics has become largely a non-concern, in favor of BROADCASTING TO THE WORLD JUST HOW FUCKED WE ARE
I have… complicated thoughts about this. My object-level read of the likely consequences is that I have no idea what the object-level consequences are likely to be, other than that this basically seems to be an attempt at heaving a gigantic rock through the Overton window, for good or for ill. (Maybe AI alignment becomes politicized as a result of this? But perhaps it already has been! And even if not, maybe politicizing it will at least raise awareness, so that it might become a cause area with similar notoriety as e.g. global warming—which appears to have at least succeeded in making token efforts to reduce greenhouse emissions?)
I just don’t know. This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space. So I suppose from his viewpoint, this action does make some sense; I am (however) vaguely annoyed on behalf of other alignment teams, whose jobs I at least mildly predict will get harder as a result of this.
This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space.
That’s not how I read it. To me it’s an attempt at the simple, obvious strategy of telling people ~all the truth he can about a subject they care a lot about and where he and they have common interests. This doesn’t seem like an attempt to be clever or explore high-variance tails. More like an attempt to explore the obvious strategy, or to follow the obvious bits of common-sense ethics, now that lots of allegedly clever 4-dimensional chess has turned out stupid.
I don’t think what you say Anna contradicts what dxu said. The obvious simple strategy is now being tried, because the galaxy brained strategies don’t seem like they are working; the galaxy-brained strategies seemed lower-variance and more sensible in general at the time, but now they seem less sensible so EY is switching to the higher-variance, less-galaxy-brained strategy.
To me it’s an attempt at the simple, obvious strategy of telling people ~all the truth he can about a subject they care a lot about and where he and they have common interests. This doesn’t seem like an attempt to be clever or explore high-variance tails. More like an attempt to explore the obvious strategy, or to follow the obvious bits of common-sense ethics, now that lots of allegedly clever 4-dimensional chess has turned out stupid.
But it does risk giving up something. Even the average tech person on a forum like Hacker News still thinks the risk of an AI apocalypse is so remote that only a crackpot would take it seriously. Their priors regarding the idea that anyone of sense could take it seriously are so low that any mention of safety seems to them a fig-leaf excuse to monopolize control for financial gain; as believable as Putin’s claims that he’s liberating the Ukraine from Nazis. (See my recent attempt to introduce the idea here .) The average person on the street is even further away from this I think.
The risk then of giving up “optics” is that you lose whatever influence you may have had entirely; you’re labelled a crackpot and nobody takes you seriously. You also risk damaging the influence of other people who are trying to be more conservative. (NB I’m not saying this will happen, but it’s a risk you have to consider.)
For instance, personally I think the reason so few people take AI alignment seriously is that we haven’t actually seen anything all that scary yet. If there were demonstrations of GPT-4, in simulation, murdering people due to mis-alignment, then this sort of a pause would be a much easier sell. Going full-bore “international treaty to control access to GPUs” now introduces the risk that, when GPT-6 is shown to murder people due to mis-alignment, people take it less seriously, because they’ve already decided AI alignment people are all crackpots.
I think the chances of an international treaty to control GPUs at this point is basically zero. I think our best bet for actually getting people to take an AI apocalypse seriously is to demonstrate an un-aligned system harming people (hopefully only in simulation), in a way that people can immediately see could extend to destroying the whole human race if the AI were more capable. (It would also give all those AI researchers something more concrete to do: figure out how to prevent this AI from doing this sort of thing; figure out other ways to get this AI to do something destructive.) Arguing to slow down AI research for other reasons—for instance, to allow society to adapt to the changes we’ve already seen—will give people more time to develop techniques for probing (and perhaps demonstrating) catastrophic alignment failures.
“For instance, personally I think the reason so few people take AI alignment seriously is that we haven’t actually seen anything all that scary yet. “
And if this “actually scary” thing happens, people will know that Yudkowsky wrote the article beforehand, and they will know who the people are that mocked it.
The average person on the street is even further away from this I think.
This contradicts the existing polls, which appear to say that everyone outside of your subculture is much more concerned about AGI killing everyone. It looks like if it came to a vote, delaying AGI in some vague way would win by a landslide, and even Eliezer’s proposal might win easily.
A majority (55%) of Americans are now worried at least somewhat that artificially intelligent machines could one day pose a risk to the human race’s existence. This marks a reversal from Monmouth’s 2015 poll, when a smaller number (44%) was worried and a majority (55%) was not.
A SurveyMonkey poll on AI conducted for USA TODAY also had overtones of concern, with 73% of respondents saying that would prefer if AI was limited in the rollout of newer tech so that it doesn’t become a threat to humans.
Meanwhile, 43% said smarter-than-humans AI would do “more harm than good,” while 38% said it would result in “equal amounts of harm and good.”
I do note, on the other side, that the general public seems more willing to go Penrose, sometimes expressing or implying a belief in quantum consciousness unprompted. That part is just my own impression.
Thanks for these, I’ll take a look. After your challenge, I tried to think of where my impression came from. I’ve had a number of conversations with relatives on Facebook (including my aunt, who is in her 60′s) about whether GPT “knows” things; but it turns out so far I’ve only had one conversation about the potential of an AI apocalypse (with my sister, who started programming 5 years ago). So I’ll reduce confidence in my assessment re what “people on the street” think, and try to look for more information.
Re HackerNews—one of the tricky things about “taking the temperature” on a forum like that is that you only see the people who post, not the people who are only reading; and unlike here, you only see the scores for your own comments, not those of others. It seems like what I said about alignment did make some connection, based on the up-votes I got; I have no idea how many upvotes the dissenters got, so I have no idea if lots of people agreed with them, or if they were the handful of lone objectors in a sea of people who agreed with me.
I think people really get used to discussing things in their research labs or in specific online communities. And then, when they try to interact with the real world and even do politics, they kind of forget how different the real world is.
Simply telling people ~all the truth may work well in some settings (although it’s far from all that matters in any setting) but almost never works well in politics. Sad but true.
I think that Eliezer (and many others including myself!) may be suspectable to “living in the should-universe” (as named by Eliezer himself).
I do not necessarily say that this particular TIME article was a bad idea, but I am feeling that people who communicate about x-risk are on average biased in this way. And it may greatly hinder the results of communication.
I also mostly agree with “people don’t take AI alignment seriously because we haven’t actually seen anything all that scary yet”. However, I think that the scary thing is not necessarily “simulated murders”. For example, a lot of people are quite concerned about unemployment caused by AI. I believe it might change perception significantly if it will actually turn out to be a big problem which seems plausible.
Yes, of course, it is a completely different issue. But on an emotional level, it will be similar (AI == bad stuff happening).
People like Ezra Klein are hearing Eliezer and rolling his position into their own more palatable takes. I really don’t think it’s necessary for everyone to play that game, it seems really good to have someone out there just speaking honestly, even if they’re far on the pessimistic tail, so others can see what’s possible. 4D chess here seems likely to fail.
Also, there’s the sentiment going around that normies who hear this are actually way more open to the simple AI Safety case than you’d expect, we’ve been extrapolating too much from current critics. Tech people have had years to formulate rationalizations and reassure one another they are clever skeptics for dismissing this stuff. Meanwhile regular folks will often spout off casual proclamations that the world is likely ending due to climate change or social decay or whatever, they seem to err on the side of doomerism as often as the opposite. The fact that Eliezer got published in TIME is already a huge point in favor of his strategy working.
EDIT: Case in point! Met a person tonight, completely offline rural anti-vax astrology doesn’t-follow-the-news type of person, I said the word AI and immediately she says she thinks “robots will eventually take over”. I understand this might not be the level of sophistication we’d desire, but at least be aware that raw material is out there. No idea how it’ll play out, but 4d chess still seems like a mistake, let Yud speak his truth.
Meanwhile regular folks will often spout off casual proclamations that the world is likely ending due to climate change or social decay or whatever, they seem to err on the side of doomerism as often as the opposite.
This is not a good thing, under my model, given that I don’t agree with doomerism.
For mindset, I agree that doomerism isn’t good, primarily because it can close your mind off of real solutions to a problem, and make you over update to the overly pessimistic view.
As a factual statement, I also disagree with high p(Doom) probabilities, and I have a maximum of 10%, if not lower.
For object level arguments for why I disagree with the doom take, here’s the arguments:
I disagree with the assumption of Yudkowskians that certain abstractions just don’t scale well when we crank them up in capabilities. I remember a post that did interpretability on AlphaZero and found it has essentially human interpretable abstractions, which at least for the case of Go disproved that Yudkowskian notion.
I am quite a bit more optimistic on scalable alignment than many in the LW community, and in the case of recent work, showed that as AI got more data, it got more aligned with human goals. There are many other benefits in the recent work, but the fact that they showed that as a certain capability scaled up, alignment scaled up, means that the trend of alignment is positive, and more capable models will probably be more aligned.
Finally, trend lines. There’s a saying that’s inspired by the Atomic Habits book: The trend line matters more than how much progress you make in a single sitting. And in the case of alignment, that trend line is positive but slow, which means we are in a extremely good position to speed up that trend. It also means we should be far less worried about doom, as we just have to increase the trend line of alignment progress and wait.
Edit: My first point is at best, partially correct, and may need to be removed altogether due to a new paper called Adversarial Policies Beat Superhuman Go AIs.
I’ll admit, that is a fairly big blow to my first point, though the rest of my points stand. I’ll edit the comment to mention your debunking of my first point.
I think that a mindset considered ‘poor’ would imply that it causes one to arrive at false conclusions more often.
If doomerism isn’t a good mindset, it should also—besides making one simply depressed and fearful / pessimistic about the future—be contradicted by empirical data, and the flow of events throughout time.
Personally, I think it’s pretty easy to show that pessimism (belief that certain objectives are impossible or doomed to cause catastrophic, unrecoverable failure) is wrong. Furthermore, and even more easily argued than that, is that belief that one’s objective is unlikely or impossible cannot cause one to be more likely to achieve it. I would define ‘poor’ mindsets to be equivalent to the latter to some significant degree.
I think that Eliezer (and many others including myself!) may be suspectable to “living in the should-universe”
That’s a new one!
More seriously: Yep, it’s possible to be making this error on a particular dimension, even if you’re a pessimist on some other dimensions. My current guess would be that Eliezer isn’t making that mistake here, though.
For one thing, the situation is more like “Eliezer thinks he tried the option you’re proposing for a long time and it didn’t work, so now he’s trying something different” (and he’s observed many others trying other things and also failing), rather than “it’s never occurred to Eliezer that LWers are different from non-LWers”.
I think it’s totally possible that Eliezer and I are missing important facts about an important demographic, but from your description I think you’re misunderstanding the TIME article as more naive and less based-on-an-underlying-complicated-model than is actually the case.
you’re misunderstanding the TIME article as more naive and less based-on-an-underlying-complicated-model than is actually the case.
I specifically said “I do not necessarily say that this particular TIME article was a bad idea” mainly because I assumed it probably wasn’t that naive. Sorry I didn’t make it clear enough.
I still decided to comment because I think this is pretty important in general, even if somewhat obvious. Looks like one of those biases which show up over and over again even if you try pretty hard to correct it.
Also, I think it’s pretty hard to judge what works and what doesn’t. The vibe has shifted a lot even in the last 6 months. I think it is plausible it shifted more than in a 10-year period 2010-2019.
For one thing, the situation is more like “Eliezer thinks he tried the option you’re proposing for a long time and it didn’t work, so now he’s trying something different”
I think this is the big disagreement I have. I do think the alignment community is working, and in general I think the trend of alignment is positive. We haven’t solved the problems, but were quite a bit closer to the solution than 10 years ago.
The only question was whether LW and the intentional creation of an alignment community was necessary, or was the alignment problem going to be solved without intentionally creating LW and a field of alignment research.
in general I think the trend of alignment is positive. We haven’t solved the problems, but were quite a bit closer to the solution than 10 years ago.
I mean, I could agree with those two claims but think the trendlines suggest we’ll have alignment solved in 200 years and superintelligent capabilities in 14 years. I guess it depends on what you mean by “quite a bit closer”; I think we’ve written up some useful semiformal descriptions of some important high-level aspects of the problem (like ‘Risks from Learned Optimization’), but this seems very far from ‘the central difficulties look 10% more solved now’, and solving 10% of the problem in 10 years is not enough!
(Of course, progress can be nonlinear—the last ten years were quite slow IMO, but that doesn’t mean the next ten years must be similarly slow. But that’s a different argument for optimism than ‘naively extrapolating the trendline suggests we’ll solve this in time’.)
I disagree, though you’re right that my initial arguments weren’t enough.
To talk about the alignment progress we’ve achieved so far, here’s a list:
We finally managed to solve the problem of deceptive alignment while being capabilities competitive. In particular, we figured out a goal that is both more outer aligned than the Maximum Likelihood Estimation goal that LLMs use, and critically it is a myopic goal, meaning we can avoid deceptive alignment even at arbitrarily high capabilities.
The more data we give to the AI, the more aligned the AI is, which is huge in the sense that we can reliably get AI to be more aligned as it’s more capable, vindicating the scalable alignment agenda.
The training method doesn’t allow the AI to affect it’s own distribution, unlike online learning, where the AI selects all the data points to learn, and thus can’t shift the distribution nor gradient hack.
As far as how much progress? I’d say this is probably 50-70% of the way there, primarily because we finally are figuring out ways to deal with core problems of alignment like deceptive alignment or outer alignment of goals without too much alignment taxes.
So one of the prerequisites of deceptive alignment is that it optimizes for non-myopic goals. In particular, these are goals that are about the long-term.
So in order to avoid deceptive alignment, one must find a goal that is both myopic and ideally scales to arbitrary capabilities.
And in a sense, that’s what Pretraining from Human Feedback found, in that the goal of cross-entropy from a feedback-annotated webtext distribution is a myopic goal, and it’s either on the capabilities frontier or outright the optimal goal for AIs. In particular, they have way less alignment taxes than other schemes.
In essence, the goal avoids deceptive alignment by removing one of the prerequisites of deceptive alignment. At the very least, it doesn’t incentivized deceptive alignment.
I think you’ll find near universal agreement among alignment researchers that deceptive alignment hasn’t been solved. (I’d say “universal” if I weren’t worried about true Scottsmen)
I do think you’ll find agreement that there are approaches where deceptive alignment seems less likely (here I note that 99% is less likely than 99.999%). This is a case Evan makes in the conditioning predictive models approach.
However, the case there isn’t that the training goal is myopic, but rather that it’s simple, so it’s a little more plausible that a model doing the ‘right’ thing is found by a training process before a model that’s deceptively aligned.
I agree that this is better than nothing, but “We finally managed to solve the problem of deceptive alignment...” is just false.
I agree that this is better than nothing, but “We finally managed to solve the problem of deceptive alignment...” is just false.
I agree, which is why I retracted my comments about deceptive alignment being solved, though I do think it’s still far better to not have incentives to be non-myopic than to have such incentives in play.
It does help in some respects. On the other hand, a system without any non-myopic goals also will not help to prevent catastrophic side-effects. If a system were intent-aligned at the top level, we could trust that it’d have the motivation to ensure any of its internal processes were sufficiently aligned, and that its output wouldn’t cause catastrophe (e.g. it wouldn’t give us a correct answer/prediction containing information it knew would be extremely harmful).
If a system only does myopic prediction, then we have to manually ensure that nothing of this kind occurs—no misaligned subsystems, no misaligned agents created, no correct-but-catastrophic outputs....
I still think it makes sense to explore in this direction, but it seems to be in the category [temporary hack that might work long enough to help us do alignment work, if we’re careful] rather than [early version of scalable alignment solution]. (though a principled hack, as hacks go)
To relate this to your initial point about progress on the overall problem, this doesn’t seem to be much evidence that we’re making progress—just that we might be closer to building a tool that may help us make progress. That’s still great—only it doesn’t tell us much about the difficulty of the real problem.
It seems to be historically the case that “doomers” or “near-doomers” (public figures who espouse pessimistic views of the future, often with calls for collective drastic actions) do not always come out with a positive public perception when doom or near-doom is perceived not to occurred, or to have occurred far away from what was predicted.
Doomers seem to have a trajectory rather than a distribution, per se. From my perspective, this is on-trajectory. He believed doom was possible, now he believes it is probable.
I’m not sure how long it will be until we get past the “doom didn’t happen” point. Assuming he exists in the future, Eliezer_future lives in the world in which he was wrong. It’s not obvious to me that Eliezer_future exists with more probability the more Eliezer_current believes Eliezer_future doesn’t exist.
I just don’t know. This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space. So I suppose from his viewpoint, this action does make some sense; I am (however) vaguely annoyed on behalf of other alignment teams, whose jobs I at least mildly predict will get harder as a result of this.
Personally, I think Eliezer’s article is actually just great for trying to get real policy change to happen here. It’s not clear to me why Eliezer saying this would make anything harder for other policy proposals. (Not that I agree with everything he said, I just think it was good that he said it.)
I am much more conflicted about the FLI letter; it’s particular policy proscription seems not great to me and I worry it makes us look pretty bad if we try approximately the same thing again with a better policy proscription after this one fails, which is approximately what I expect we’ll need to do.
(Though to be fair this is as someone who’s also very much on the pessimistic side and so tends to like variance.)
It would’ve been even better for this to happen long before the year of the prediction mentioned in this old blog-post, but this is better than nothing.
I think this is probably right. When all hope is gone, try just telling people the truth and see what happens. I don’t expect it will work, I don’t expect Eliezer expects it to work, but it may be our last chance to stop it.
takes a deep breath
(Epistemic status: vague, ill-formed first impressions.)
So that’s what we’re doing, huh? I suppose EY/MIRI has reached the point where worrying about memetics / optics has become largely a non-concern, in favor of BROADCASTING TO THE WORLD JUST HOW FUCKED WE ARE
I have… complicated thoughts about this. My object-level read of the likely consequences is that I have no idea what the object-level consequences are likely to be, other than that this basically seems to be an attempt at heaving a gigantic rock through the Overton window, for good or for ill. (Maybe AI alignment becomes politicized as a result of this? But perhaps it already has been! And even if not, maybe politicizing it will at least raise awareness, so that it might become a cause area with similar notoriety as e.g. global warming—which appears to have at least succeeded in making token efforts to reduce greenhouse emissions?)
I just don’t know. This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space. So I suppose from his viewpoint, this action does make some sense; I am (however) vaguely annoyed on behalf of other alignment teams, whose jobs I at least mildly predict will get harder as a result of this.
That’s not how I read it. To me it’s an attempt at the simple, obvious strategy of telling people ~all the truth he can about a subject they care a lot about and where he and they have common interests. This doesn’t seem like an attempt to be clever or explore high-variance tails. More like an attempt to explore the obvious strategy, or to follow the obvious bits of common-sense ethics, now that lots of allegedly clever 4-dimensional chess has turned out stupid.
I don’t think what you say Anna contradicts what dxu said. The obvious simple strategy is now being tried, because the galaxy brained strategies don’t seem like they are working; the galaxy-brained strategies seemed lower-variance and more sensible in general at the time, but now they seem less sensible so EY is switching to the higher-variance, less-galaxy-brained strategy.
But it does risk giving up something. Even the average tech person on a forum like Hacker News still thinks the risk of an AI apocalypse is so remote that only a crackpot would take it seriously. Their priors regarding the idea that anyone of sense could take it seriously are so low that any mention of safety seems to them a fig-leaf excuse to monopolize control for financial gain; as believable as Putin’s claims that he’s liberating the Ukraine from Nazis. (See my recent attempt to introduce the idea here .) The average person on the street is even further away from this I think.
The risk then of giving up “optics” is that you lose whatever influence you may have had entirely; you’re labelled a crackpot and nobody takes you seriously. You also risk damaging the influence of other people who are trying to be more conservative. (NB I’m not saying this will happen, but it’s a risk you have to consider.)
For instance, personally I think the reason so few people take AI alignment seriously is that we haven’t actually seen anything all that scary yet. If there were demonstrations of GPT-4, in simulation, murdering people due to mis-alignment, then this sort of a pause would be a much easier sell. Going full-bore “international treaty to control access to GPUs” now introduces the risk that, when GPT-6 is shown to murder people due to mis-alignment, people take it less seriously, because they’ve already decided AI alignment people are all crackpots.
I think the chances of an international treaty to control GPUs at this point is basically zero. I think our best bet for actually getting people to take an AI apocalypse seriously is to demonstrate an un-aligned system harming people (hopefully only in simulation), in a way that people can immediately see could extend to destroying the whole human race if the AI were more capable. (It would also give all those AI researchers something more concrete to do: figure out how to prevent this AI from doing this sort of thing; figure out other ways to get this AI to do something destructive.) Arguing to slow down AI research for other reasons—for instance, to allow society to adapt to the changes we’ve already seen—will give people more time to develop techniques for probing (and perhaps demonstrating) catastrophic alignment failures.
“For instance, personally I think the reason so few people take AI alignment seriously is that we haven’t actually seen anything all that scary yet. “
And if this “actually scary” thing happens, people will know that Yudkowsky wrote the article beforehand, and they will know who the people are that mocked it.
This contradicts the existing polls, which appear to say that everyone outside of your subculture is much more concerned about AGI killing everyone. It looks like if it came to a vote, delaying AGI in some vague way would win by a landslide, and even Eliezer’s proposal might win easily.
Can you give a reference? A quick Google search didn’t turn anything like that up.
Here’s some more:
https://www.monmouth.edu/polling-institute/reports/monmouthpoll_us_021523/
I’ll look for the one that asked about the threat to humanity, and broke down responses by race and gender. In the meantime, here’s a poll showing general unease and bipartisan willingness to legally restrict the use of AI: https://web.archive.org/web/20180109060531/http://www.pewinternet.org/2017/10/04/automation-in-everyday-life/
Plus:
I do note, on the other side, that the general public seems more willing to go Penrose, sometimes expressing or implying a belief in quantum consciousness unprompted. That part is just my own impression.
This may be what I was thinking of, though the data is more ambiguous or self-contradictory: https://www.vox.com/future-perfect/2019/1/9/18174081/fhi-govai-ai-safety-american-public-worried-ai-catastrophe
Thanks for these, I’ll take a look. After your challenge, I tried to think of where my impression came from. I’ve had a number of conversations with relatives on Facebook (including my aunt, who is in her 60′s) about whether GPT “knows” things; but it turns out so far I’ve only had one conversation about the potential of an AI apocalypse (with my sister, who started programming 5 years ago). So I’ll reduce confidence in my assessment re what “people on the street” think, and try to look for more information.
Re HackerNews—one of the tricky things about “taking the temperature” on a forum like that is that you only see the people who post, not the people who are only reading; and unlike here, you only see the scores for your own comments, not those of others. It seems like what I said about alignment did make some connection, based on the up-votes I got; I have no idea how many upvotes the dissenters got, so I have no idea if lots of people agreed with them, or if they were the handful of lone objectors in a sea of people who agreed with me.
I second this.
I think people really get used to discussing things in their research labs or in specific online communities. And then, when they try to interact with the real world and even do politics, they kind of forget how different the real world is.
Simply telling people ~all the truth may work well in some settings (although it’s far from all that matters in any setting) but almost never works well in politics. Sad but true.
I think that Eliezer (and many others including myself!) may be suspectable to “living in the should-universe” (as named by Eliezer himself).
I do not necessarily say that this particular TIME article was a bad idea, but I am feeling that people who communicate about x-risk are on average biased in this way. And it may greatly hinder the results of communication.
I also mostly agree with “people don’t take AI alignment seriously because we haven’t actually seen anything all that scary yet”. However, I think that the scary thing is not necessarily “simulated murders”. For example, a lot of people are quite concerned about unemployment caused by AI. I believe it might change perception significantly if it will actually turn out to be a big problem which seems plausible.
Yes, of course, it is a completely different issue. But on an emotional level, it will be similar (AI == bad stuff happening).
People like Ezra Klein are hearing Eliezer and rolling his position into their own more palatable takes. I really don’t think it’s necessary for everyone to play that game, it seems really good to have someone out there just speaking honestly, even if they’re far on the pessimistic tail, so others can see what’s possible. 4D chess here seems likely to fail.
https://steno.ai/the-ezra-klein-show/my-view-on-ai
Also, there’s the sentiment going around that normies who hear this are actually way more open to the simple AI Safety case than you’d expect, we’ve been extrapolating too much from current critics. Tech people have had years to formulate rationalizations and reassure one another they are clever skeptics for dismissing this stuff. Meanwhile regular folks will often spout off casual proclamations that the world is likely ending due to climate change or social decay or whatever, they seem to err on the side of doomerism as often as the opposite. The fact that Eliezer got published in TIME is already a huge point in favor of his strategy working.
EDIT: Case in point! Met a person tonight, completely offline rural anti-vax astrology doesn’t-follow-the-news type of person, I said the word AI and immediately she says she thinks “robots will eventually take over”. I understand this might not be the level of sophistication we’d desire, but at least be aware that raw material is out there. No idea how it’ll play out, but 4d chess still seems like a mistake, let Yud speak his truth.
This is not a good thing, under my model, given that I don’t agree with doomerism.
You disagree with doomerism as a mindset, or factual likelihood? Or both?
I think doomerism as a mindset isn’t great, but in terms of likelihood, there are ~3 things likely to kill humanity atm. AI being the first.
Both as a mindset and as a factual likelihood.
For mindset, I agree that doomerism isn’t good, primarily because it can close your mind off of real solutions to a problem, and make you over update to the overly pessimistic view.
As a factual statement, I also disagree with high p(Doom) probabilities, and I have a maximum of 10%, if not lower.
For object level arguments for why I disagree with the doom take, here’s the arguments:
I disagree with the assumption of Yudkowskians that certain abstractions just don’t scale well when we crank them up in capabilities. I remember a post that did interpretability on AlphaZero and found it has essentially human interpretable abstractions, which at least for the case of Go disproved that Yudkowskian notion.
I am quite a bit more optimistic on scalable alignment than many in the LW community, and in the case of recent work, showed that as AI got more data, it got more aligned with human goals. There are many other benefits in the recent work, but the fact that they showed that as a certain capability scaled up, alignment scaled up, means that the trend of alignment is positive, and more capable models will probably be more aligned.
Finally, trend lines. There’s a saying that’s inspired by the Atomic Habits book: The trend line matters more than how much progress you make in a single sitting. And in the case of alignment, that trend line is positive but slow, which means we are in a extremely good position to speed up that trend. It also means we should be far less worried about doom, as we just have to increase the trend line of alignment progress and wait.
Edit: My first point is at best, partially correct, and may need to be removed altogether due to a new paper called Adversarial Policies Beat Superhuman Go AIs.
Link below:
https://arxiv.org/abs/2211.00241
All other points stand.
Recent Adversarial Policies Beat Superhuman Go AIs seem to plant doubt how well abstractions generalize in the case of Go.
I’ll admit, that is a fairly big blow to my first point, though the rest of my points stand. I’ll edit the comment to mention your debunking of my first point.
I think that a mindset considered ‘poor’ would imply that it causes one to arrive at false conclusions more often.
If doomerism isn’t a good mindset, it should also—besides making one simply depressed and fearful / pessimistic about the future—be contradicted by empirical data, and the flow of events throughout time.
Personally, I think it’s pretty easy to show that pessimism (belief that certain objectives are impossible or doomed to cause catastrophic, unrecoverable failure) is wrong. Furthermore, and even more easily argued than that, is that belief that one’s objective is unlikely or impossible cannot cause one to be more likely to achieve it. I would define ‘poor’ mindsets to be equivalent to the latter to some significant degree.
That’s a new one!
More seriously: Yep, it’s possible to be making this error on a particular dimension, even if you’re a pessimist on some other dimensions. My current guess would be that Eliezer isn’t making that mistake here, though.
For one thing, the situation is more like “Eliezer thinks he tried the option you’re proposing for a long time and it didn’t work, so now he’s trying something different” (and he’s observed many others trying other things and also failing), rather than “it’s never occurred to Eliezer that LWers are different from non-LWers”.
I think it’s totally possible that Eliezer and I are missing important facts about an important demographic, but from your description I think you’re misunderstanding the TIME article as more naive and less based-on-an-underlying-complicated-model than is actually the case.
I specifically said “I do not necessarily say that this particular TIME article was a bad idea” mainly because I assumed it probably wasn’t that naive. Sorry I didn’t make it clear enough.
I still decided to comment because I think this is pretty important in general, even if somewhat obvious. Looks like one of those biases which show up over and over again even if you try pretty hard to correct it.
Also, I think it’s pretty hard to judge what works and what doesn’t. The vibe has shifted a lot even in the last 6 months. I think it is plausible it shifted more than in a 10-year period 2010-2019.
I think this is the big disagreement I have. I do think the alignment community is working, and in general I think the trend of alignment is positive. We haven’t solved the problems, but were quite a bit closer to the solution than 10 years ago.
The only question was whether LW and the intentional creation of an alignment community was necessary, or was the alignment problem going to be solved without intentionally creating LW and a field of alignment research.
I mean, I could agree with those two claims but think the trendlines suggest we’ll have alignment solved in 200 years and superintelligent capabilities in 14 years. I guess it depends on what you mean by “quite a bit closer”; I think we’ve written up some useful semiformal descriptions of some important high-level aspects of the problem (like ‘Risks from Learned Optimization’), but this seems very far from ‘the central difficulties look 10% more solved now’, and solving 10% of the problem in 10 years is not enough!
(Of course, progress can be nonlinear—the last ten years were quite slow IMO, but that doesn’t mean the next ten years must be similarly slow. But that’s a different argument for optimism than ‘naively extrapolating the trendline suggests we’ll solve this in time’.)
I disagree, though you’re right that my initial arguments weren’t enough.
To talk about the alignment progress we’ve achieved so far, here’s a list:
We finally managed to solve the problem of deceptive alignment while being capabilities competitive. In particular, we figured out a goal that is both more outer aligned than the Maximum Likelihood Estimation goal that LLMs use, and critically it is a myopic goal, meaning we can avoid deceptive alignment even at arbitrarily high capabilities.
The more data we give to the AI, the more aligned the AI is, which is huge in the sense that we can reliably get AI to be more aligned as it’s more capable, vindicating the scalable alignment agenda.
The training method doesn’t allow the AI to affect it’s own distribution, unlike online learning, where the AI selects all the data points to learn, and thus can’t shift the distribution nor gradient hack.
As far as how much progress? I’d say this is probably 50-70% of the way there, primarily because we finally are figuring out ways to deal with core problems of alignment like deceptive alignment or outer alignment of goals without too much alignment taxes.
“We finally managed to solve the problem of deceptive alignment while being capabilities competitive”
??????
Good question to ask, and I’ll explain.
So one of the prerequisites of deceptive alignment is that it optimizes for non-myopic goals. In particular, these are goals that are about the long-term.
So in order to avoid deceptive alignment, one must find a goal that is both myopic and ideally scales to arbitrary capabilities.
And in a sense, that’s what Pretraining from Human Feedback found, in that the goal of cross-entropy from a feedback-annotated webtext distribution is a myopic goal, and it’s either on the capabilities frontier or outright the optimal goal for AIs. In particular, they have way less alignment taxes than other schemes.
In essence, the goal avoids deceptive alignment by removing one of the prerequisites of deceptive alignment. At the very least, it doesn’t incentivized deceptive alignment.
You seem to be conflating myopic training with myopic cognition.
Myopic training is not sufficient to ensure myopic cognition.
I think you’ll find near universal agreement among alignment researchers that deceptive alignment hasn’t been solved. (I’d say “universal” if I weren’t worried about true Scottsmen)
I do think you’ll find agreement that there are approaches where deceptive alignment seems less likely (here I note that 99% is less likely than 99.999%). This is a case Evan makes in the conditioning predictive models approach.
However, the case there isn’t that the training goal is myopic, but rather that it’s simple, so it’s a little more plausible that a model doing the ‘right’ thing is found by a training process before a model that’s deceptively aligned.
I agree that this is better than nothing, but “We finally managed to solve the problem of deceptive alignment...” is just false.
I agree, which is why I retracted my comments about deceptive alignment being solved, though I do think it’s still far better to not have incentives to be non-myopic than to have such incentives in play.
It does help in some respects.
On the other hand, a system without any non-myopic goals also will not help to prevent catastrophic side-effects. If a system were intent-aligned at the top level, we could trust that it’d have the motivation to ensure any of its internal processes were sufficiently aligned, and that its output wouldn’t cause catastrophe (e.g. it wouldn’t give us a correct answer/prediction containing information it knew would be extremely harmful).
If a system only does myopic prediction, then we have to manually ensure that nothing of this kind occurs—no misaligned subsystems, no misaligned agents created, no correct-but-catastrophic outputs....
I still think it makes sense to explore in this direction, but it seems to be in the category [temporary hack that might work long enough to help us do alignment work, if we’re careful] rather than [early version of scalable alignment solution]. (though a principled hack, as hacks go)
To relate this to your initial point about progress on the overall problem, this doesn’t seem to be much evidence that we’re making progress—just that we might be closer to building a tool that may help us make progress.
That’s still great—only it doesn’t tell us much about the difficulty of the real problem.
It seems to be historically the case that “doomers” or “near-doomers” (public figures who espouse pessimistic views of the future, often with calls for collective drastic actions) do not always come out with a positive public perception when doom or near-doom is perceived not to occurred, or to have occurred far away from what was predicted.
Doomers seem to have a trajectory rather than a distribution, per se. From my perspective, this is on-trajectory. He believed doom was possible, now he believes it is probable.
I’m not sure how long it will be until we get past the “doom didn’t happen” point. Assuming he exists in the future, Eliezer_future lives in the world in which he was wrong. It’s not obvious to me that Eliezer_future exists with more probability the more Eliezer_current believes Eliezer_future doesn’t exist.
Personally, I think Eliezer’s article is actually just great for trying to get real policy change to happen here. It’s not clear to me why Eliezer saying this would make anything harder for other policy proposals. (Not that I agree with everything he said, I just think it was good that he said it.)
I am much more conflicted about the FLI letter; it’s particular policy proscription seems not great to me and I worry it makes us look pretty bad if we try approximately the same thing again with a better policy proscription after this one fails, which is approximately what I expect we’ll need to do.
(Though to be fair this is as someone who’s also very much on the pessimistic side and so tends to like variance.)
It would’ve been even better for this to happen long before the year of the prediction mentioned in this old blog-post, but this is better than nothing.
I think this is probably right. When all hope is gone, try just telling people the truth and see what happens. I don’t expect it will work, I don’t expect Eliezer expects it to work, but it may be our last chance to stop it.