On the other hand, it could be considered bad news that IDA/Debate/etc. haven’t been deployed yet, or even that RLHF is (at least apparently) working as well as it is. To quote a 2017 post by Paul Christiano (later reposted in 2018 and 2019):
As in the previous sections, it’s easy to be too optimistic about exactly when a non-scalable alignment scheme will break down. It’s much easier to keep ourselves honest if we actually hold ourselves to producing scalable systems.
It seems that AI labs are not yet actually holding themselves to producing scalable systems, and it may well be better if RLHF broke down in some obvious way before we reach potentially dangerous capabilities, to force them to do that.
(I’ve pointed Paul to this thread to get his own take, but haven’t gotten a response yet.)
ETA: I should also note that there is a lot of debate about whether IDA and Debate are actually scalable or not, so some could consider even deployment of IDA or Debate (or these techniques appearing to work well) to be bad news. I’ve tended to argue on the “they are too risky” side in the past, but am conflicted because maybe they are just the best that we can realistically hope for and at least an improvement over RLHF?
I think these methods are pretty clearly not indefinitely scalable, but they might be pretty scalable. E.g., perhaps scalable to somewhat smarter than human level AI. See the ELK report for more discussion on why these methods aren’t indefinitely scalable.
A while ago, I think Paul had maybe 50% that with simple-ish tweaks IDA could be literally indefinitely scalable. (I’m not aware of an online source for this, but I’m pretty confident this or something similar is true.) IMO, this seems very predictably wrong.
TBC, I don’t think we should necessarily care very much about whether a method is indefinitely scalable.
Sometimes people do seem to think that debate or IDA could be indefinitely scalable, but this just seems pretty wrong to me (what is your debate about alphafold going to look like...).
On the other hand, it could be considered bad news that IDA/Debate/etc. haven’t been deployed yet, or even that RLHF is (at least apparently) working as well as it is. To quote a 2017 post by Paul Christiano (later reposted in 2018 and 2019):
It seems that AI labs are not yet actually holding themselves to producing scalable systems, and it may well be better if RLHF broke down in some obvious way before we reach potentially dangerous capabilities, to force them to do that.
(I’ve pointed Paul to this thread to get his own take, but haven’t gotten a response yet.)
ETA: I should also note that there is a lot of debate about whether IDA and Debate are actually scalable or not, so some could consider even deployment of IDA or Debate (or these techniques appearing to work well) to be bad news. I’ve tended to argue on the “they are too risky” side in the past, but am conflicted because maybe they are just the best that we can realistically hope for and at least an improvement over RLHF?
I think these methods are pretty clearly not indefinitely scalable, but they might be pretty scalable. E.g., perhaps scalable to somewhat smarter than human level AI. See the ELK report for more discussion on why these methods aren’t indefinitely scalable.
A while ago, I think Paul had maybe 50% that with simple-ish tweaks IDA could be literally indefinitely scalable. (I’m not aware of an online source for this, but I’m pretty confident this or something similar is true.) IMO, this seems very predictably wrong.
TBC, I don’t think we should necessarily care very much about whether a method is indefinitely scalable.
Sometimes people do seem to think that debate or IDA could be indefinitely scalable, but this just seems pretty wrong to me (what is your debate about alphafold going to look like...).
I think the first presentation of the argument that IDA/Debate aren’t indefinitely scalable was in Inaccessible Information, fwiw.