Jan_Kulveit comments on Prize for probable problems

Jan_Kulveit 9 Mar 2018 10:02 UTC
4 points
My intuition is this is not particularly stable against adversarial inputs. Trying to think about is as a practical problem, I would attack it in this way
- provide adversarial inputs to A0 which will try to manipulate them so they are better at the “simple task at hand” but e.g. have slightly distorted some model of the world
- it seems feasible to craft manipulations which would “evaluate” and steer the whole system only while already at superhuman level, so we have Amplify(H,A[10]) where A(10) is superhuman and optimizing to take over the control for the attackers goal
In vivid language, you seed the personal assistants with the right ideas … and many iterations later, they start a communist revolution