Yudkowsky didnt dismiss neural networks iirc. He just said that there were a lot of different approaches to AI and from the Outside View it didnt seem clear which was promising—and plausibly on an Inside View it wasnt very clear that aritificial neural networks were going to work and work so well.
Re:alignment
I dont follow. We dont know who will be proved ultimately right on alignment so im not sure how you can make such strong statements about whether Yudkowsky was right or wrong on this aspect.
We havent really gained that much bits on this question and plausibly will not gain many until later (by which time it might be too late if Yudkowsky is right).
I do agree that Yudkowsky’s statements occasionally feel too confidently and dogmatically pessimistic on the question of Doom. But I would argue that the problem is that we simply dont know well because of irreducible uncertainty—not that Doom is unlikely.
Mostly, I’m annoyed by how much his argumentation around alignment matches the pattern of dismissing various approaches to alignment using similar reasoning to how he dismissed neural networks:
Even if it was correct to dismiss neural networks years ago, it isn’t now, so it’s not a good sign that the arguments rely on this issue:
I am going to argue that we do have quite a lot of bits on alignment, and the basic argument can be summarized like this:
Human values are much less complicated than people thought, and also more influenced by data than people thought 15-20 years ago, and thus much, much easier to specify than people thought 15-20 years ago.
That’s the takeaway I have from current LLMs handling human values, and I basically agree with Linch’s summary of Matthew Barnett’s post on the historical value misspecification argument of what that means in practice for alignment:
It’s not about LLM safety properties, but about what has been revealed about our values.
Another way to say it is that we don’t need to reverse-engineer social instincts for alignment, contra @Steven Byrnes, because we can massively simplify what the social instinct parts of our brain that contribute to alignment are doing in code, because while the mechanisms for how humans get their morality and not be psychopaths are complicated, it doesn’t matter, because we can replicate it’s function with much simpler code and data, and go to a more blank-slate design for AIs:
(A similar trick is one path to solving robotics for AIs, but note this is only one part, it might be that the solution routes through a different mechanism).
Really, I’m not mad about his original ideas, because they might have been correct, and it wasn’t obviously incorrect, I’m just mad that he didn’t realize that he had to update to reality more radically than he had realized, and seems to conflate the bad argument for AI will understand our values, therefore it’s safe, with the better argument that LLMs show it’s easier to specify values without drastically wrong results, and that it’s not a complete solution to alignment, but a big advance on outer alignment in the usual dichotomy.
To my mind an important dimension, perhaps the most important dimensions is how values be evolve under reflection.
It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil. This is certainly not unheard of in the real world (let alone fiction!).
Of course it’s a question about the basin of attraction around helpfulness and harmlessness.
I guess I have only weak priors on what this might look like under reflection, although plausibly friendliness is magic.
It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil.
I disagree, but could be a difference in definition of what “perfectly aligned values” means. Eg if the AI is dumb (for an AGI) and in a rush, sure. If its a superintelligence already, even in a rush, seems unlikely. [edit:] If we have found an SAE feature which seems to light up for good stuff, and down for bad stuff 100% of the time, then we clamp it, then yeah, that could go away on reflection.
Another way to say it is how values evolve in OOD situations.
My general prior, albeit reasonably weak is that the best single way to predict how values evolve is looking at their data sources, as well as what data they received up to now, and the second best way to predict it is looking at what their algorithms are, especially for social situations, and that most of the other factors don’t matter nearly as much.
Yudkowsky didnt dismiss neural networks iirc. He just said that there were a lot of different approaches to AI and from the Outside View it didnt seem clear which was promising—and plausibly on an Inside View it wasnt very clear that aritificial neural networks were going to work and work so well.
Re:alignment I dont follow. We dont know who will be proved ultimately right on alignment so im not sure how you can make such strong statements about whether Yudkowsky was right or wrong on this aspect.
We havent really gained that much bits on this question and plausibly will not gain many until later (by which time it might be too late if Yudkowsky is right).
I do agree that Yudkowsky’s statements occasionally feel too confidently and dogmatically pessimistic on the question of Doom. But I would argue that the problem is that we simply dont know well because of irreducible uncertainty—not that Doom is unlikely.
Mostly, I’m annoyed by how much his argumentation around alignment matches the pattern of dismissing various approaches to alignment using similar reasoning to how he dismissed neural networks:
Even if it was correct to dismiss neural networks years ago, it isn’t now, so it’s not a good sign that the arguments rely on this issue:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#HpPcxG9bPDFTB4i6a
I am going to argue that we do have quite a lot of bits on alignment, and the basic argument can be summarized like this:
Human values are much less complicated than people thought, and also more influenced by data than people thought 15-20 years ago, and thus much, much easier to specify than people thought 15-20 years ago.
That’s the takeaway I have from current LLMs handling human values, and I basically agree with Linch’s summary of Matthew Barnett’s post on the historical value misspecification argument of what that means in practice for alignment:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument#N9ManBfJ7ahhnqmu7
It’s not about LLM safety properties, but about what has been revealed about our values.
Another way to say it is that we don’t need to reverse-engineer social instincts for alignment, contra @Steven Byrnes, because we can massively simplify what the social instinct parts of our brain that contribute to alignment are doing in code, because while the mechanisms for how humans get their morality and not be psychopaths are complicated, it doesn’t matter, because we can replicate it’s function with much simpler code and data, and go to a more blank-slate design for AIs:
https://www.lesswrong.com/posts/PTkd8nazvH9HQpwP8/building-brain-inspired-agi-is-infinitely-easier-than#If_some_circuit_in_the_brain_is_doing_something_useful__then_it_s_humanly_feasible_to_understand_what_that_thing_is_and_why_it_s_useful__and_to_write_our_own_CPU_code_that_does_the_same_useful_thing_
(A similar trick is one path to solving robotics for AIs, but note this is only one part, it might be that the solution routes through a different mechanism).
Really, I’m not mad about his original ideas, because they might have been correct, and it wasn’t obviously incorrect, I’m just mad that he didn’t realize that he had to update to reality more radically than he had realized, and seems to conflate the bad argument for AI will understand our values, therefore it’s safe, with the better argument that LLMs show it’s easier to specify values without drastically wrong results, and that it’s not a complete solution to alignment, but a big advance on outer alignment in the usual dichotomy.
It’s a plausible argument imho. Time will tell.
To my mind an important dimension, perhaps the most important dimensions is how values be evolve under reflection.
It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil. This is certainly not unheard of in the real world (let alone fiction!). Of course it’s a question about the basin of attraction around helpfulness and harmlessness. I guess I have only weak priors on what this might look like under reflection, although plausibly friendliness is magic.
I disagree, but could be a difference in definition of what “perfectly aligned values” means. Eg if the AI is dumb (for an AGI) and in a rush, sure. If its a superintelligence already, even in a rush, seems unlikely. [edit:] If we have found an SAE feature which seems to light up for good stuff, and down for bad stuff 100% of the time, then we clamp it, then yeah, that could go away on reflection.
Another way to say it is how values evolve in OOD situations.
My general prior, albeit reasonably weak is that the best single way to predict how values evolve is looking at their data sources, as well as what data they received up to now, and the second best way to predict it is looking at what their algorithms are, especially for social situations, and that most of the other factors don’t matter nearly as much.