there are clearly some training setups that seem more dangerous than other training setups . . . .
Like, as an example, my guess is systems where a substantial chunk of the compute was spent on training with reinforcement learning in environments that reward long-term planning and agentic resource acquisition (e.g. many video games or diplomacy or various simulations with long-term objectives) sure seem more dangerous.
Any recommended reading on which training setups are safer? If none exist, someone should really write this up.
Any recommended reading on which training setups are safer? If none exist, someone should really write this up.