Thanks for the additions here. I’m also unsure how to gel this definition (which I quite like) with the inner/outer/mesa terminology. Here is my knuckle dragging model of the post’s implication:
target_set = f(env, agent)
So if we plug in a bunch of values for agent and hope for the best, the target_set we get might might not be what we desired. This would be misalignment. Whereas the alignment task is more like to fix target_set and env and solve for agent.
The stuff about mesa optimisers mainly sounds like inadequate (narrow) modelling of what env, agent and target_set are. Usually fixating on some fraction of the problem (win the battle, lose the war problem).
Thanks for the additions here. I’m also unsure how to gel this definition (which I quite like) with the inner/outer/mesa terminology. Here is my knuckle dragging model of the post’s implication:
target_set = f(env, agent)
So if we plug in a bunch of values for
agent
and hope for the best, thetarget_set
we get might might not be what we desired. This would be misalignment. Whereas the alignment task is more like to fixtarget_set
andenv
and solve foragent
.The stuff about mesa optimisers mainly sounds like inadequate (narrow) modelling of what
env
,agent
andtarget_set
are. Usually fixating on some fraction of the problem (win the battle, lose the war problem).