Goal is what an agent optimizes for at a given point in time. Value is the initial goal of an agent (in your toy model at least).
In my root post it seems to be optimal for agent A to self-modify into agent A’, which optimizes for G2, thus agent A’ succeeds in optimizing world according to its values (goal of agent A). But original goal doesn’t influence its optimization procedure anymore. Thus if we’ll analyze agent A’ (without knowledge of world’s history), we’ll be unable to infer its values (its original goal).
I’m afraid I still don’t understand your reasoning. How are “goals” different from “values”, in your terms?
Goal is what an agent optimizes for at a given point in time. Value is the initial goal of an agent (in your toy model at least).
In my root post it seems to be optimal for agent A to self-modify into agent A’, which optimizes for G2, thus agent A’ succeeds in optimizing world according to its values (goal of agent A). But original goal doesn’t influence its optimization procedure anymore. Thus if we’ll analyze agent A’ (without knowledge of world’s history), we’ll be unable to infer its values (its original goal).
Yes, that seems to be correct.