I’ve mostly studied the likelihood function for power-seeking behavior: what decision-making procedures, objectives, and environments produce what behavioral tendencies. I’ve discovered some gears for what situations cause what kinds of behaviors.
The power-seeking theorems also allow some discussion of P(agent behavior | agent training process, training parameters, environment), but it’s harder to reason about eventual agent behavior with fewer gears of what kinds of agent cognition are trained.
Selection theorems. P(agent decision-making procedure, agent objective, other internals | training process, environment). What kinds of cognition will be trained in what kinds of situations? This gives mechanistic pictures of how cognition will work, with consequences for interpretability work, for alignment agendas, and for forecasting.
If we understood both of these, as a bonus we would be much better able to predict P(power-seeking | environment, training process) via P(power-seeking | agent internals) P(agent internals | environment, training process).[1]
How the power-seeking theorems relate to the selection theorem agenda.
Power-seeking theorems. P(agent behavior | agent decision-making procedure, agent objective, other agent internals, environment).
I’ve mostly studied the likelihood function for power-seeking behavior: what decision-making procedures, objectives, and environments produce what behavioral tendencies. I’ve discovered some gears for what situations cause what kinds of behaviors.
The power-seeking theorems also allow some discussion of P(agent behavior | agent training process, training parameters, environment), but it’s harder to reason about eventual agent behavior with fewer gears of what kinds of agent cognition are trained.
Selection theorems. P(agent decision-making procedure, agent objective, other internals | training process, environment). What kinds of cognition will be trained in what kinds of situations? This gives mechanistic pictures of how cognition will work, with consequences for interpretability work, for alignment agendas, and for forecasting.
If we understood both of these, as a bonus we would be much better able to predict P(power-seeking | environment, training process) via P(power-seeking | agent internals) P(agent internals | environment, training process).[1]
For power-seeking, agent internals screens off the environment and training process.