Note 2 seems worth discussing further. The key step, as I see it, is that the AI is not a communication consequentialist, and it does not model the effects of its advice on the world. I would suggest calling this “Ivory Tower AI” or maybe “Ivory Box”.
To sketch one way this might work, queries could take the form “What could Agent(s) X do to achieve Y?” and the AI then reasons as if it had magic control over the mental states of X, formulates a plan, and expresses it according to predefined rules. Both the magic control and the expression rules are non-trivial problems, but I don’t see any reason they’d be fr
Friendliness-level difficult.
(just never ever let Agent X be “a Tool AI” in your query)
Note 2 seems worth discussing further. The key step, as I see it, is that the AI is not a communication consequentialist, and it does not model the effects of its advice on the world. I would suggest calling this “Ivory Tower AI” or maybe “Ivory Box”.
To sketch one way this might work, queries could take the form “What could Agent(s) X do to achieve Y?” and the AI then reasons as if it had magic control over the mental states of X, formulates a plan, and expresses it according to predefined rules. Both the magic control and the expression rules are non-trivial problems, but I don’t see any reason they’d be fr Friendliness-level difficult.
(just never ever let Agent X be “a Tool AI” in your query)