My intuition is this is not particularly stable against adversarial inputs. Trying to think about is as a practical problem, I would attack it in this way
provide adversarial inputs to A0 which will try to manipulate them so they are better at the “simple task at hand” but e.g. have slightly distorted some model of the world
it seems feasible to craft manipulations which would “evaluate” and steer the whole system only while already at superhuman level, so we have Amplify(H,A[10]) where A(10) is superhuman and optimizing to take over the control for the attackers goal
In vivid language, you seed the personal assistants with the right ideas … and many iterations later, they start a communist revolution
My intuition is this is not particularly stable against adversarial inputs. Trying to think about is as a practical problem, I would attack it in this way
provide adversarial inputs to A0 which will try to manipulate them so they are better at the “simple task at hand” but e.g. have slightly distorted some model of the world
it seems feasible to craft manipulations which would “evaluate” and steer the whole system only while already at superhuman level, so we have Amplify(H,A[10]) where A(10) is superhuman and optimizing to take over the control for the attackers goal
In vivid language, you seed the personal assistants with the right ideas … and many iterations later, they start a communist revolution