If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.
Seems you are mostly considering solution (1) above, except in the last paragraph where you consider a somewhat special version if (2). I believe that Eliezer is saying in the discussion above that solution (1) is a lot more difficult than some people proposing it seem to think. He could be nicer about how he says it, but overall I tend to agree.
In my own alignment work I am mostly looking at solution (2), specifically to create a game-theoretical setup where the agent has a reduced, hopefully even non-existent, motivation to ever manipulate humans. This means you look for a solution where you make interventions on the agent environment, reward function, or other design elements, not on the agent ML system.
Modern mainstream ML research of course almost never considers the design or evaluation of such non-ML-system interventions.
Seems you are mostly considering solution (1) above, except in the last paragraph where you consider a somewhat special version if (2). I believe that Eliezer is saying in the discussion above that solution (1) is a lot more difficult than some people proposing it seem to think. He could be nicer about how he says it, but overall I tend to agree.
In my own alignment work I am mostly looking at solution (2), specifically to create a game-theoretical setup where the agent has a reduced, hopefully even non-existent, motivation to ever manipulate humans. This means you look for a solution where you make interventions on the agent environment, reward function, or other design elements, not on the agent ML system.
Modern mainstream ML research of course almost never considers the design or evaluation of such non-ML-system interventions.