I’m really glad you wrote this! I think you address an important distinction there, but I think there might be a further one to be made- in that how we measure/tell if a model is aligned in the first place. There seems to be a growing voice which says that if a model’s output seems to be the output we might expect from an aligned AI, then it’s aligned. I think it’s important to distinguish that from the idea that the model is aligned if you actually have a strong idea of what it’s values are, how it’s gotten them, etc.
I’m really glad you wrote this!
I think you address an important distinction there, but I think there might be a further one to be made- in that how we measure/tell if a model is aligned in the first place.
There seems to be a growing voice which says that if a model’s output seems to be the output we might expect from an aligned AI, then it’s aligned.
I think it’s important to distinguish that from the idea that the model is aligned if you actually have a strong idea of what it’s values are, how it’s gotten them, etc.