Let’s say: if you train a coherently goal-directed, situationally aware, somewhat-better-than-human-level model using baseline forms of self-supervised pre-training + RLHF on diverse, long-horizon, real-world tasks, my subjective probability is ~25% that this model will be performing well in training in substantial part as part of an instrumental strategy for seeking power for itself and/or other AIs later.
Have you tried extending this gut estimate to something like:
If many labs use somewhat different training procedures to train their models but that each falls under the umbrella of “coherently goal-directed, situationally aware [...]”, what is the probability that at least one of these models “will be performing well in training in substantial part as part of an instrumental strategy for seeking power for itself and/or other AIs later.”?
Have you tried extending this gut estimate to something like: