3. Stop worrying about finding “outer objectives” which are safe to maximize.[9]I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function).
Instead, focus on building good cognition within the agent.
In my ontology, there’s only an inner alignment problem: How do we grow good cognition inside of the trained agent?
This vibes well with what I’ve been thinking about recently.
There a post in the back of my mind called “Character alignment”, which is about how framing alignment in terms of values, goals, reward etc is maybe not always ideal, because at least introspectively for me these seem to be strongly influenced by a more general structure of my cognition, i.e. my character.
Where character can be understood as a certain number of specific strategic priors, which might make good optimisation targets because they drop out of game theoretic considerations, and therefore are possibly quite generally and robustly modelled by sufficiently advanced agents.
This vibes well with what I’ve been thinking about recently.
There a post in the back of my mind called “Character alignment”, which is about how framing alignment in terms of values, goals, reward etc is maybe not always ideal, because at least introspectively for me these seem to be strongly influenced by a more general structure of my cognition, i.e. my character.
Where character can be understood as a certain number of specific strategic priors, which might make good optimisation targets because they drop out of game theoretic considerations, and therefore are possibly quite generally and robustly modelled by sufficiently advanced agents.