Regarding daemons starting as upstream and becoming downstream...
I think this makes it sound like the goal (soit Y) of the daemon changes, but I usually don’t think of it that way.
What changes is that pursuing Y initially leads to rapidly improving performance at X, but then the performance of X and Y pull apart as the daemon optimizes more heavily for Y.
It seems highly analogous to hacking a learned reward function.
Isn’t that the problem Christiano is relatively optimistic about solving through training on different distributions? In particular, it seems like we’d notice the daemon’s goals diverging from ours and try to correct it. If I understand Christiano’s point correctly, he is optimistic that we could do so successfully due to the existence of a “basin of corrigibility”. OTOH, the most concerning case would be when the daemon is intentionally trying to hide the fact that its goals are diverging from X, because we may not realize the system is misaligned before the daemon executes a treacherous turn.
Instead, I interpreted “starting as upstream and becoming downstream” as a description of a phenomenon that might look roughly like this:
1. Upstream daemon (UD) is “discovered” by some optimization process because it seems to do a good job optimizing X on the training data
2. The daemon is “initialized” by the outer optimizer and starts “running”
3. The outer optimizer (OO) provides a set of inputs to the daemon with the goal of optimizing X
4. Based on these inputs, the daemon “notices” that the entity that “initialized” it is trying to optimize X rather than Y
5. The UD infers that the outer optimizer will “terminate” it if the OO discovers the UD is optimizing for Y rather than X
6. The UD starts optimizing for X in order to survive (a convergent instrumental goal regardless of Y)
7. The OO observes the UD’s performance on a wide variety of training data and infers that the UD is not in fact a upstream daemon
8. The daemon is “set loose” and proceeds to execute a treacherous turn
As you can tell, my model of how daemons work is extremely lacking; I’ve used quotes for every phrase that feels dubious to me.
One major question I have which this model, if correct, partially resolves for me: How does the downstream daemon ever “know” what X we are optimizing for in order for the outer optimizer to “discover” it?
Suppose the best a system can do without a daemon is 97% accuracy.
The daemon can figure out how to get 99% accuracy.
But in order to outperform other systems, it can just provide 98% accuracy, and use 1% of inputs to pursue it’s own agenda.
This all happens on-distribution.
If there are multiple daemon-containing systems competing for survival (with selection happening according to accuracy), this might force them to maximize accuracy, instead of just beating a “non-daemon baseline”.
This is all only relevant to downstream daemons, right? If so, I don’t understand why the DD would ever provide 98% accuracy; I’d expect it to provide 99% accuracy until it sees a chance to provide [arbitarily low]% accuracy and start pursuing its agenda directly. As you say, this might happen due to competition between daemon-containing systems, but I think a DD would want to maximize its chances of survival by maximizng its accuracy either way.
I think it’s relevant for either kind (actually, I’m not sure I like the distinction, or find it particularly relevant).
If there aren’t other daemons to compete with, then 98% is sufficient for survival, so why not use the extra 1% to begin pursuing your own agenda immediately and covertly? This seems to be how principle-agent problems often play out in real life with humans.
Regarding daemons starting as upstream and becoming downstream...
I think this makes it sound like the goal (soit Y) of the daemon changes, but I usually don’t think of it that way.
What changes is that pursuing Y initially leads to rapidly improving performance at X, but then the performance of X and Y pull apart as the daemon optimizes more heavily for Y.
It seems highly analogous to hacking a learned reward function.
Isn’t that the problem Christiano is relatively optimistic about solving through training on different distributions? In particular, it seems like we’d notice the daemon’s goals diverging from ours and try to correct it. If I understand Christiano’s point correctly, he is optimistic that we could do so successfully due to the existence of a “basin of corrigibility”. OTOH, the most concerning case would be when the daemon is intentionally trying to hide the fact that its goals are diverging from X, because we may not realize the system is misaligned before the daemon executes a treacherous turn.
Instead, I interpreted “starting as upstream and becoming downstream” as a description of a phenomenon that might look roughly like this:
1. Upstream daemon (UD) is “discovered” by some optimization process because it seems to do a good job optimizing X on the training data
2. The daemon is “initialized” by the outer optimizer and starts “running”
3. The outer optimizer (OO) provides a set of inputs to the daemon with the goal of optimizing X
4. Based on these inputs, the daemon “notices” that the entity that “initialized” it is trying to optimize X rather than Y
5. The UD infers that the outer optimizer will “terminate” it if the OO discovers the UD is optimizing for Y rather than X
6. The UD starts optimizing for X in order to survive (a convergent instrumental goal regardless of Y)
7. The OO observes the UD’s performance on a wide variety of training data and infers that the UD is not in fact a upstream daemon
8. The daemon is “set loose” and proceeds to execute a treacherous turn
As you can tell, my model of how daemons work is extremely lacking; I’ve used quotes for every phrase that feels dubious to me.
One major question I have which this model, if correct, partially resolves for me: How does the downstream daemon ever “know” what X we are optimizing for in order for the outer optimizer to “discover” it?
A concrete vision:
Suppose the best a system can do without a daemon is 97% accuracy.
The daemon can figure out how to get 99% accuracy.
But in order to outperform other systems, it can just provide 98% accuracy, and use 1% of inputs to pursue it’s own agenda.
This all happens on-distribution.
If there are multiple daemon-containing systems competing for survival (with selection happening according to accuracy), this might force them to maximize accuracy, instead of just beating a “non-daemon baseline”.
This is all only relevant to downstream daemons, right? If so, I don’t understand why the DD would ever provide 98% accuracy; I’d expect it to provide 99% accuracy until it sees a chance to provide [arbitarily low]% accuracy and start pursuing its agenda directly. As you say, this might happen due to competition between daemon-containing systems, but I think a DD would want to maximize its chances of survival by maximizng its accuracy either way.
I think it’s relevant for either kind (actually, I’m not sure I like the distinction, or find it particularly relevant).
If there aren’t other daemons to compete with, then 98% is sufficient for survival, so why not use the extra 1% to begin pursuing your own agenda immediately and covertly? This seems to be how principle-agent problems often play out in real life with humans.