[Question] Is AI alignment a purely functional property?

In some recent discussions I have realized that there is a quite a nasty implied disagreement about whether AI alignment is a functional property or not, that is if your personal definition of whether an AI is “aligned” is purely a function of its input/​output behavior irrespective of what kind of crazy things are going on inside to generate that behavior.

So I’d like to ask the community whether it is currently considered the mainstream take that ‘Alignment’ is functional (only input/​output mapping matters) or whether the internal computation matters (it’s not OK to think a naughty thought and then have some subroutine that cancels it, for example).

No answers.