Mostly, I’m annoyed by how much his argumentation around alignment matches the pattern of dismissing various approaches to alignment using similar reasoning to how he dismissed neural networks:
Even if it was correct to dismiss neural networks years ago, it isn’t now, so it’s not a good sign that the arguments rely on this issue:
I am going to argue that we do have quite a lot of bits on alignment, and the basic argument can be summarized like this:
Human values are much less complicated than people thought, and also more influenced by data than people thought 15-20 years ago, and thus much, much easier to specify than people thought 15-20 years ago.
That’s the takeaway I have from current LLMs handling human values, and I basically agree with Linch’s summary of Matthew Barnett’s post on the historical value misspecification argument of what that means in practice for alignment:
It’s not about LLM safety properties, but about what has been revealed about our values.
Another way to say it is that we don’t need to reverse-engineer social instincts for alignment, contra @Steven Byrnes, because we can massively simplify what the social instinct parts of our brain that contribute to alignment are doing in code, because while the mechanisms for how humans get their morality and not be psychopaths are complicated, it doesn’t matter, because we can replicate it’s function with much simpler code and data, and go to a more blank-slate design for AIs:
(A similar trick is one path to solving robotics for AIs, but note this is only one part, it might be that the solution routes through a different mechanism).
Really, I’m not mad about his original ideas, because they might have been correct, and it wasn’t obviously incorrect, I’m just mad that he didn’t realize that he had to update to reality more radically than he had realized, and seems to conflate the bad argument for AI will understand our values, therefore it’s safe, with the better argument that LLMs show it’s easier to specify values without drastically wrong results, and that it’s not a complete solution to alignment, but a big advance on outer alignment in the usual dichotomy.
To my mind an important dimension, perhaps the most important dimensions is how values be evolve under reflection.
It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil. This is certainly not unheard of in the real world (let alone fiction!).
Of course it’s a question about the basin of attraction around helpfulness and harmlessness.
I guess I have only weak priors on what this might look like under reflection, although plausibly friendliness is magic.
It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil.
I disagree, but could be a difference in definition of what “perfectly aligned values” means. Eg if the AI is dumb (for an AGI) and in a rush, sure. If its a superintelligence already, even in a rush, seems unlikely. [edit:] If we have found an SAE feature which seems to light up for good stuff, and down for bad stuff 100% of the time, then we clamp it, then yeah, that could go away on reflection.
Another way to say it is how values evolve in OOD situations.
My general prior, albeit reasonably weak is that the best single way to predict how values evolve is looking at their data sources, as well as what data they received up to now, and the second best way to predict it is looking at what their algorithms are, especially for social situations, and that most of the other factors don’t matter nearly as much.
Mostly, I’m annoyed by how much his argumentation around alignment matches the pattern of dismissing various approaches to alignment using similar reasoning to how he dismissed neural networks:
Even if it was correct to dismiss neural networks years ago, it isn’t now, so it’s not a good sign that the arguments rely on this issue:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#HpPcxG9bPDFTB4i6a
I am going to argue that we do have quite a lot of bits on alignment, and the basic argument can be summarized like this:
Human values are much less complicated than people thought, and also more influenced by data than people thought 15-20 years ago, and thus much, much easier to specify than people thought 15-20 years ago.
That’s the takeaway I have from current LLMs handling human values, and I basically agree with Linch’s summary of Matthew Barnett’s post on the historical value misspecification argument of what that means in practice for alignment:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument#N9ManBfJ7ahhnqmu7
It’s not about LLM safety properties, but about what has been revealed about our values.
Another way to say it is that we don’t need to reverse-engineer social instincts for alignment, contra @Steven Byrnes, because we can massively simplify what the social instinct parts of our brain that contribute to alignment are doing in code, because while the mechanisms for how humans get their morality and not be psychopaths are complicated, it doesn’t matter, because we can replicate it’s function with much simpler code and data, and go to a more blank-slate design for AIs:
https://www.lesswrong.com/posts/PTkd8nazvH9HQpwP8/building-brain-inspired-agi-is-infinitely-easier-than#If_some_circuit_in_the_brain_is_doing_something_useful__then_it_s_humanly_feasible_to_understand_what_that_thing_is_and_why_it_s_useful__and_to_write_our_own_CPU_code_that_does_the_same_useful_thing_
(A similar trick is one path to solving robotics for AIs, but note this is only one part, it might be that the solution routes through a different mechanism).
Really, I’m not mad about his original ideas, because they might have been correct, and it wasn’t obviously incorrect, I’m just mad that he didn’t realize that he had to update to reality more radically than he had realized, and seems to conflate the bad argument for AI will understand our values, therefore it’s safe, with the better argument that LLMs show it’s easier to specify values without drastically wrong results, and that it’s not a complete solution to alignment, but a big advance on outer alignment in the usual dichotomy.
It’s a plausible argument imho. Time will tell.
To my mind an important dimension, perhaps the most important dimensions is how values be evolve under reflection.
It’s quite plausible to me that starting with an AI that has pretty aligned values it will self-reflect into evil. This is certainly not unheard of in the real world (let alone fiction!). Of course it’s a question about the basin of attraction around helpfulness and harmlessness. I guess I have only weak priors on what this might look like under reflection, although plausibly friendliness is magic.
I disagree, but could be a difference in definition of what “perfectly aligned values” means. Eg if the AI is dumb (for an AGI) and in a rush, sure. If its a superintelligence already, even in a rush, seems unlikely. [edit:] If we have found an SAE feature which seems to light up for good stuff, and down for bad stuff 100% of the time, then we clamp it, then yeah, that could go away on reflection.
Another way to say it is how values evolve in OOD situations.
My general prior, albeit reasonably weak is that the best single way to predict how values evolve is looking at their data sources, as well as what data they received up to now, and the second best way to predict it is looking at what their algorithms are, especially for social situations, and that most of the other factors don’t matter nearly as much.