Also, I have another strange idea that might increase the probability of this working.
If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of “true human values”?
I don’t think it’s likely to work, but thought I’d share anyway.
Also, I have another strange idea that might increase the probability of this working.
If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of “true human values”?
I don’t think it’s likely to work, but thought I’d share anyway.