I guess the main issue that I have with this argument is that an AI system that is extremely good at prediction is unlikely to just have a high-level concept corresponding to human values (if it does contain such a concept). Instead, it’s likely to also include a high-level concept corresponding to what people say about about values—or rather several corresponding to what various different groups would say about human-values. If your proxy is based on what people say, then these concepts which correspond to what people say will match much better—and the probability of at least one of these concepts being the best match is increased by large the number of these. So I don’t put a very high weight on this scenario at all.
This is a great explanation. I basically agree, and this is exactly why I expect alignment-by-default to most likely fail even conditional on the natural abstractions hypothesis holding up.
Also, I have another strange idea that might increase the probability of this working.
If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of “true human values”?
I don’t think it’s likely to work, but thought I’d share anyway.
Is this why you put the probability as “10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values”? Or have you updated your probabilities since writing this post?
I guess the main issue that I have with this argument is that an AI system that is extremely good at prediction is unlikely to just have a high-level concept corresponding to human values (if it does contain such a concept). Instead, it’s likely to also include a high-level concept corresponding to what people say about about values—or rather several corresponding to what various different groups would say about human-values. If your proxy is based on what people say, then these concepts which correspond to what people say will match much better—and the probability of at least one of these concepts being the best match is increased by large the number of these. So I don’t put a very high weight on this scenario at all.
This is a great explanation. I basically agree, and this is exactly why I expect alignment-by-default to most likely fail even conditional on the natural abstractions hypothesis holding up.
Also, I have another strange idea that might increase the probability of this working.
If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of “true human values”?
I don’t think it’s likely to work, but thought I’d share anyway.
Thanks!
Is this why you put the probability as “10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values”? Or have you updated your probabilities since writing this post?
Yup, this is basically where that probability came from. It still feels about right.