Why can’t we just have a simple process of open dialogue with values of truth, individual agency during the reflection, and some clearly defined contract at the end of the long reflection to like, take power away from the AGI drones?
How is that process implemented? How do we give power to that process, which optimizes for things like “truth” and “individual agency”, over processes that optimize just for power; over processes that Goodhart for whatever metric we’re looking at in order to decide which process to give power to?
And if we have some way to define “truth” and “individual agency” directly, and tell our strawberry-aligned AI to only give power to such processes — is it really just strawberry-aligned? Why can’t we instead just tell it to build an utopia, if our command of alignment is so strong as to robustly define “truth” and “individual agency” to the AI?
Yep, fair point. In my original comment I seemed to forget about the problem of AIs goodharting our long reflection. I probably agree now that doing a pivotal act into a long reflection is approximately as difficult as solving alignment.
(Side-note about how my brain works: I notice that when I think through all the argumentative steps deliberately, I do believe this statement: “Making an AI which helps humans clarify their values is approximately as hard as making an AI care about any simple, specific thing.” However it does not come to mind automatically when I’m reasoning about alignment. 2 Possible fixes:
Think more concretely about Retargeting the Search when I think about solving alignment. This makes the problems seem similar in difficulty.
Meditate on just how hard it is to target an AI at something. Sometimes I forget how Goodhartable any objective is.
)
Thanks!
How is that process implemented? How do we give power to that process, which optimizes for things like “truth” and “individual agency”, over processes that optimize just for power; over processes that Goodhart for whatever metric we’re looking at in order to decide which process to give power to?
And if we have some way to define “truth” and “individual agency” directly, and tell our strawberry-aligned AI to only give power to such processes — is it really just strawberry-aligned? Why can’t we instead just tell it to build an utopia, if our command of alignment is so strong as to robustly define “truth” and “individual agency” to the AI?
Yep, fair point. In my original comment I seemed to forget about the problem of AIs goodharting our long reflection. I probably agree now that doing a pivotal act into a long reflection is approximately as difficult as solving alignment.
(Side-note about how my brain works: I notice that when I think through all the argumentative steps deliberately, I do believe this statement: “Making an AI which helps humans clarify their values is approximately as hard as making an AI care about any simple, specific thing.” However it does not come to mind automatically when I’m reasoning about alignment. 2 Possible fixes:
Think more concretely about Retargeting the Search when I think about solving alignment. This makes the problems seem similar in difficulty.
Meditate on just how hard it is to target an AI at something. Sometimes I forget how Goodhartable any objective is. )