Suppose the best a system can do without a daemon is 97% accuracy.
The daemon can figure out how to get 99% accuracy.
But in order to outperform other systems, it can just provide 98% accuracy, and use 1% of inputs to pursue it’s own agenda.
This all happens on-distribution.
If there are multiple daemon-containing systems competing for survival (with selection happening according to accuracy), this might force them to maximize accuracy, instead of just beating a “non-daemon baseline”.
This is all only relevant to downstream daemons, right? If so, I don’t understand why the DD would ever provide 98% accuracy; I’d expect it to provide 99% accuracy until it sees a chance to provide [arbitarily low]% accuracy and start pursuing its agenda directly. As you say, this might happen due to competition between daemon-containing systems, but I think a DD would want to maximize its chances of survival by maximizng its accuracy either way.
I think it’s relevant for either kind (actually, I’m not sure I like the distinction, or find it particularly relevant).
If there aren’t other daemons to compete with, then 98% is sufficient for survival, so why not use the extra 1% to begin pursuing your own agenda immediately and covertly? This seems to be how principle-agent problems often play out in real life with humans.
A concrete vision:
Suppose the best a system can do without a daemon is 97% accuracy.
The daemon can figure out how to get 99% accuracy.
But in order to outperform other systems, it can just provide 98% accuracy, and use 1% of inputs to pursue it’s own agenda.
This all happens on-distribution.
If there are multiple daemon-containing systems competing for survival (with selection happening according to accuracy), this might force them to maximize accuracy, instead of just beating a “non-daemon baseline”.
This is all only relevant to downstream daemons, right? If so, I don’t understand why the DD would ever provide 98% accuracy; I’d expect it to provide 99% accuracy until it sees a chance to provide [arbitarily low]% accuracy and start pursuing its agenda directly. As you say, this might happen due to competition between daemon-containing systems, but I think a DD would want to maximize its chances of survival by maximizng its accuracy either way.
I think it’s relevant for either kind (actually, I’m not sure I like the distinction, or find it particularly relevant).
If there aren’t other daemons to compete with, then 98% is sufficient for survival, so why not use the extra 1% to begin pursuing your own agenda immediately and covertly? This seems to be how principle-agent problems often play out in real life with humans.