it’s conceivable that an AI can pull off a treacherous turn on its first try [...if...] an AI trained by some non-imitation-learning self-supervised method (e.g. self-play)
It depends on the type of self-play. If the self-play is entirely between AIs, no other human-like parties in the environment, I agree. Because these self-play AI’s could learn to cooperate/scheme very powerfully. But if the environment contain (simulations of) human(-like) agents, including intentionally weak ones, and the evaluation includes scoring factors like care and collaboration with them, then it might look different.
Yes, but it makes a difference if your environment is composed entirely of singular agents of the same kind (the self-playing AI) or if it has a variety of simulated agents acting in complex social structures, where the behavior of the self-play AI in the social structure is scored.
It depends on the type of self-play. If the self-play is entirely between AIs, no other human-like parties in the environment, I agree. Because these self-play AI’s could learn to cooperate/scheme very powerfully. But if the environment contain (simulations of) human(-like) agents, including intentionally weak ones, and the evaluation includes scoring factors like care and collaboration with them, then it might look different.
I think you’d actually need some presence of some human-like entities in order for the AI to learn to deceive humans specifically.
Yes, but it makes a difference if your environment is composed entirely of singular agents of the same kind (the self-playing AI) or if it has a variety of simulated agents acting in complex social structures, where the behavior of the self-play AI in the social structure is scored.