Yeah, I do think that Moravec and Leike got the AI situation most correct, and yeah people were wrong to dismiss Yudkowsky for having short timelines.
This was the thing they got most correct, which is interesting because unfortunately, Yudkowsky got almost everything else incorrect about how superhuman AIs would work, and also got the alignment situation very wrong as well, which is very important to take note of.
LW in general got short timelines and the idea that AI will probably be the biggest deal in history correct, but went wrong in assuming they knew well about how AI would eventually work (remember the times when Eliezer Yudkowsky dismissed neural networks working for capabilities instead of legible logic?) and also got the alignment situation very wrong, due to way overcomplexifying human values and relying on the evopsych frame way too much for human values, combined with not noticing that the differences between humans and evolution that mattered for capabilities also mattered for alignment.
I believe a lot of the issue comes down to incorrectly conflating the logical possibility of misalignment with the probability of misalignement being high enough that we should take serious action, and the interlocutors they talked with often denied the possibility that misalignment could happen at all, but LWers then didn’t realize that reality doesn’t grade on a curve, and though their arguments were better than their interlocutors, that didn’t mean they were right.
Yudkowsky didnt dismiss neural networks iirc. He just said that there were a lot of different approaches to AI and from the Outside View it didnt seem clear which was promising—and plausibly on an Inside View it wasnt very clear that aritificial neural networks were going to work and work so well.
Re:alignment
I dont follow. We dont know who will be proved ultimately right on alignment so im not sure how you can make such strong statements about whether Yudkowsky was right or wrong on this aspect.
We havent really gained that much bits on this question and plausibly will not gain many until later (by which time it might be too late if Yudkowsky is right).
I do agree that Yudkowsky’s statements occasionally feel too confidently and dogmatically pessimistic on the question of Doom. But I would argue that the problem is that we simply dont know well because of irreducible uncertainty—not that Doom is unlikely.
Mostly, I’m annoyed by how much his argumentation around alignment matches the pattern of dismissing various approaches to alignment using similar reasoning to how he dismissed neural networks:
Even if it was correct to dismiss neural networks years ago, it isn’t now, so it’s not a good sign that the arguments rely on this issue:
I am going to argue that we do have quite a lot of bits on alignment, and the basic argument can be summarized like this:
Human values are much less complicated than people thought, and also more influenced by data than people thought 15-20 years ago, and thus much, much easier to specify than people thought 15-20 years ago.
That’s the takeaway I have from current LLMs handling human values, and I basically agree with Linch’s summary of Matthew Barnett’s post on the historical value misspecification argument of what that means in practice for alignment:
It’s not about LLM safety properties, but about what has been revealed about our values.
Another way to say it is that we don’t need to reverse-engineer social instincts for alignment, contra @Steven Byrnes, because we can massively simplify what the social instinct parts of our brain that contribute to alignment are doing in code, because while the mechanisms for how humans get their morality and not be psychopaths are complicated, it doesn’t matter, because we can replicate it’s function with much simpler code and data, and go to a more blank-slate design for AIs:
(A similar trick is one path to solving robotics for AIs, but note this is only one part, it might be that the solution routes through a different mechanism).
Really, I’m not mad about his original ideas, because they might have been correct, and it wasn’t obviously incorrect, I’m just mad that he didn’t realize that he had to update to reality more radically than he had realized, and seems to conflate the bad argument for AI will understand our values, therefore it’s safe, with the better argument that LLMs show it’s easier to specify values without drastically wrong results, and that it’s not a complete solution to alignment, but a big advance on outer alignment in the usual dichotomy.
To my mind an important dimension, perhaps the most important dimensions is how values be evolve under reflection.
It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil. This is certainly not unheard of in the real world (let alone fiction!).
Of course it’s a question about the basin of attraction around helpfulness and harmlessness.
I guess I have only weak priors on what this might look like under reflection, although plausibly friendliness is magic.
It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil.
I disagree, but could be a difference in definition of what “perfectly aligned values” means. Eg if the AI is dumb (for an AGI) and in a rush, sure. If its a superintelligence already, even in a rush, seems unlikely. [edit:] If we have found an SAE feature which seems to light up for good stuff, and down for bad stuff 100% of the time, then we clamp it, then yeah, that could go away on reflection.
Another way to say it is how values evolve in OOD situations.
My general prior, albeit reasonably weak is that the best single way to predict how values evolve is looking at their data sources, as well as what data they received up to now, and the second best way to predict it is looking at what their algorithms are, especially for social situations, and that most of the other factors don’t matter nearly as much.
Yudkowsky got almost everything else incorrect about how superhuman AIs would work,
I think this statement is incredibly overconfident, because literally nobody knows how superhuman AI would work.
And, I think, this is general shape of problem: incredible number of people got incredibly overindexed on how LLMs worked in 2022-2023 and drew conclusions which seem to be plausible, but not as probable as these people think.
The really short summary is human values are less complicated and more dependent on data than people thought, and we can specify our values rather easily without it going drastically wrong:
Yeah, I do think that Moravec and Leike got the AI situation most correct, and yeah people were wrong to dismiss Yudkowsky for having short timelines.
This was the thing they got most correct, which is interesting because unfortunately, Yudkowsky got almost everything else incorrect about how superhuman AIs would work, and also got the alignment situation very wrong as well, which is very important to take note of.
LW in general got short timelines and the idea that AI will probably be the biggest deal in history correct, but went wrong in assuming they knew well about how AI would eventually work (remember the times when Eliezer Yudkowsky dismissed neural networks working for capabilities instead of legible logic?) and also got the alignment situation very wrong, due to way overcomplexifying human values and relying on the evopsych frame way too much for human values, combined with not noticing that the differences between humans and evolution that mattered for capabilities also mattered for alignment.
I believe a lot of the issue comes down to incorrectly conflating the logical possibility of misalignment with the probability of misalignement being high enough that we should take serious action, and the interlocutors they talked with often denied the possibility that misalignment could happen at all, but LWers then didn’t realize that reality doesn’t grade on a curve, and though their arguments were better than their interlocutors, that didn’t mean they were right.
Yudkowsky didnt dismiss neural networks iirc. He just said that there were a lot of different approaches to AI and from the Outside View it didnt seem clear which was promising—and plausibly on an Inside View it wasnt very clear that aritificial neural networks were going to work and work so well.
Re:alignment I dont follow. We dont know who will be proved ultimately right on alignment so im not sure how you can make such strong statements about whether Yudkowsky was right or wrong on this aspect.
We havent really gained that much bits on this question and plausibly will not gain many until later (by which time it might be too late if Yudkowsky is right).
I do agree that Yudkowsky’s statements occasionally feel too confidently and dogmatically pessimistic on the question of Doom. But I would argue that the problem is that we simply dont know well because of irreducible uncertainty—not that Doom is unlikely.
Mostly, I’m annoyed by how much his argumentation around alignment matches the pattern of dismissing various approaches to alignment using similar reasoning to how he dismissed neural networks:
Even if it was correct to dismiss neural networks years ago, it isn’t now, so it’s not a good sign that the arguments rely on this issue:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#HpPcxG9bPDFTB4i6a
I am going to argue that we do have quite a lot of bits on alignment, and the basic argument can be summarized like this:
Human values are much less complicated than people thought, and also more influenced by data than people thought 15-20 years ago, and thus much, much easier to specify than people thought 15-20 years ago.
That’s the takeaway I have from current LLMs handling human values, and I basically agree with Linch’s summary of Matthew Barnett’s post on the historical value misspecification argument of what that means in practice for alignment:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument#N9ManBfJ7ahhnqmu7
It’s not about LLM safety properties, but about what has been revealed about our values.
Another way to say it is that we don’t need to reverse-engineer social instincts for alignment, contra @Steven Byrnes, because we can massively simplify what the social instinct parts of our brain that contribute to alignment are doing in code, because while the mechanisms for how humans get their morality and not be psychopaths are complicated, it doesn’t matter, because we can replicate it’s function with much simpler code and data, and go to a more blank-slate design for AIs:
https://www.lesswrong.com/posts/PTkd8nazvH9HQpwP8/building-brain-inspired-agi-is-infinitely-easier-than#If_some_circuit_in_the_brain_is_doing_something_useful__then_it_s_humanly_feasible_to_understand_what_that_thing_is_and_why_it_s_useful__and_to_write_our_own_CPU_code_that_does_the_same_useful_thing_
(A similar trick is one path to solving robotics for AIs, but note this is only one part, it might be that the solution routes through a different mechanism).
Really, I’m not mad about his original ideas, because they might have been correct, and it wasn’t obviously incorrect, I’m just mad that he didn’t realize that he had to update to reality more radically than he had realized, and seems to conflate the bad argument for AI will understand our values, therefore it’s safe, with the better argument that LLMs show it’s easier to specify values without drastically wrong results, and that it’s not a complete solution to alignment, but a big advance on outer alignment in the usual dichotomy.
It’s a plausible argument imho. Time will tell.
To my mind an important dimension, perhaps the most important dimensions is how values be evolve under reflection.
It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil. This is certainly not unheard of in the real world (let alone fiction!). Of course it’s a question about the basin of attraction around helpfulness and harmlessness. I guess I have only weak priors on what this might look like under reflection, although plausibly friendliness is magic.
I disagree, but could be a difference in definition of what “perfectly aligned values” means. Eg if the AI is dumb (for an AGI) and in a rush, sure. If its a superintelligence already, even in a rush, seems unlikely. [edit:] If we have found an SAE feature which seems to light up for good stuff, and down for bad stuff 100% of the time, then we clamp it, then yeah, that could go away on reflection.
Another way to say it is how values evolve in OOD situations.
My general prior, albeit reasonably weak is that the best single way to predict how values evolve is looking at their data sources, as well as what data they received up to now, and the second best way to predict it is looking at what their algorithms are, especially for social situations, and that most of the other factors don’t matter nearly as much.
I think this statement is incredibly overconfident, because literally nobody knows how superhuman AI would work.
And, I think, this is general shape of problem: incredible number of people got incredibly overindexed on how LLMs worked in 2022-2023 and drew conclusions which seem to be plausible, but not as probable as these people think.
Okay, I talked more on what conclusions we can draw from LLMs that actually generalize to superhuman AI here, so go check that out:
https://www.lesswrong.com/posts/tDkYdyJSqe3DddtK4/alexander-gietelink-oldenziel-s-shortform#mPaBbsfpwgdvoK2Z2
The really short summary is human values are less complicated and more dependent on data than people thought, and we can specify our values rather easily without it going drastically wrong:
This is not a property of LLMs, but of us.
is that supposed to be a link?
I rewrote the comment to put the link immediately below the first sentence.
The link is at the very bottom of the comment.