Chris_Leong comments on Alignment By Default

Chris_Leong 22 Mar 2021 4:50 UTC
LW: 4 AF: 3
AF
Also, I have another strange idea that might increase the probability of this working.
If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of “true human values”?
I don’t think it’s likely to work, but thought I’d share anyway.