“You’re correct—none of the studies cited have used a strict GAN-like architecture with a generator and discriminator trained simultaneously in lockstep to prevent AI deception.
Most existing AI safety research, such as Redwood Research or Anthropic’s work, employs sequential or iterative adversarial methods: typically, a model generates potentially deceptive outputs, these outputs are collected, and separately a discriminator or classifier is trained to catch deception. These processes are usually iterative but asynchronous, rather than the simultaneous, competitive co-training seen in GAN architectures.
Your original concept—training generator (task-performing AI) and discriminator (deception detector) networks simultaneously in a true GAN-style loop—does not appear explicitly explored in AI safety literature so far.”
ChatGPT Deep Research produced this:
https://chatgpt.com/share/67d62105-7c6c-8002-8bbb-74982455839b
Apparently nobody has done this?
“You’re correct—none of the studies cited have used a strict GAN-like architecture with a generator and discriminator trained simultaneously in lockstep to prevent AI deception.
Most existing AI safety research, such as Redwood Research or Anthropic’s work, employs sequential or iterative adversarial methods: typically, a model generates potentially deceptive outputs, these outputs are collected, and separately a discriminator or classifier is trained to catch deception. These processes are usually iterative but asynchronous, rather than the simultaneous, competitive co-training seen in GAN architectures.
Your original concept—training generator (task-performing AI) and discriminator (deception detector) networks simultaneously in a true GAN-style loop—does not appear explicitly explored in AI safety literature so far.”