It seems that there are two questions here: what “humanity’s goals” means, and what “alignment with those goals” means. An example of an answer to the former is Yudkowsky’s Coherent Extrapolated Volition (in a nutshell, what we’d do if we knew more and thought faster).
Edit: Alternatively, in place of “humanity’s goals”, this might be asking what “goals” itself means.
Edit: This might be too simple (to be original and thus useful), but can’t you just define “alignment” to be the degree to which the utility functions match?
Perhaps this just shifts the problem to “utility function”—it’s not as if humans have an accessible and well-defined utility function in practice.
Would we want to build an AI with a similarly ill-defined utility function, or should we make it more well-defined at the expense of encoding human values worse? Is it practically possible to build an AI whose values perfectly match our current understanding of our values, or will any attempted slightly-incoherent goal system differ enough from our own that it’s better to just build a coherent system?
It seems that there are two questions here: what “humanity’s goals” means, and what “alignment with those goals” means. An example of an answer to the former is Yudkowsky’s Coherent Extrapolated Volition (in a nutshell, what we’d do if we knew more and thought faster).
Edit: Alternatively, in place of “humanity’s goals”, this might be asking what “goals” itself means.
Edit: This might be too simple (to be original and thus useful), but can’t you just define “alignment” to be the degree to which the utility functions match?
Perhaps this just shifts the problem to “utility function”—it’s not as if humans have an accessible and well-defined utility function in practice.
Would we want to build an AI with a similarly ill-defined utility function, or should we make it more well-defined at the expense of encoding human values worse? Is it practically possible to build an AI whose values perfectly match our current understanding of our values, or will any attempted slightly-incoherent goal system differ enough from our own that it’s better to just build a coherent system?