First of all, thank you so much for this post! I found it generally very convincing, but there were a few things that felt missing, and I was wondering if you could expand on them.
However, I expect that neither mechanism will produce as much of a relative jump in AI capabilities, as cultural development produced in humans. Neither mechanism would suddenly unleash an optimizer multiple orders of magnitude faster than anything that came before, as was the case when humans transitioned from biological evolution to cultural development.
Why do you expect this? Surely the difference between passive and active learning, or the ability to view and manipulate one’s own source code (or that of a successor) would be pretty enormous? Also it feels like this implicitly assumes that relatively dumb algorithms like SGD or Predictive-processing/hebbian-learning will not be improved upon during such a feedback loop.
On the topic of alignment, it feels like many of the techniques you mention are not at all good candidates, because they focus on correcting bad behavior as it appears. It seems like we mainly have a problem if powerful superhuman capabilities arrive before we have robustly aligned a system to good values. Currently, none of those methods have (as far as I can tell) any chance of scaling up, in particular because at some point we won’t be able apply corrective pressures to a model that has decided to deceive us. Do we have any examples of a system where we apply corrective pressure early to instill some values, and then scale up performance without needing to continue to apply more corrective pressure?
First of all, thank you so much for this post! I found it generally very convincing, but there were a few things that felt missing, and I was wondering if you could expand on them.
Why do you expect this? Surely the difference between passive and active learning, or the ability to view and manipulate one’s own source code (or that of a successor) would be pretty enormous? Also it feels like this implicitly assumes that relatively dumb algorithms like SGD or Predictive-processing/hebbian-learning will not be improved upon during such a feedback loop.
On the topic of alignment, it feels like many of the techniques you mention are not at all good candidates, because they focus on correcting bad behavior as it appears. It seems like we mainly have a problem if powerful superhuman capabilities arrive before we have robustly aligned a system to good values. Currently, none of those methods have (as far as I can tell) any chance of scaling up, in particular because at some point we won’t be able apply corrective pressures to a model that has decided to deceive us. Do we have any examples of a system where we apply corrective pressure early to instill some values, and then scale up performance without needing to continue to apply more corrective pressure?