Returning to this essay, it continues to be my favorite Paul post (even What Failure Looks Like only comes second), and I think it’s the best way to engage with Paul’s work than anything else (including the Eliciting Latent Knowledge document, which feels less grounded in the x-risk problem, is less in Paul’s native language, and gets detailed on just one idea for 10x the space thus communicating less of the big picture research goal). I feel I can understand all the arguments made in this post. I think this should be mandatory reading before reading Eliciting Latent Knowledge.
Overview of why:
The motivation behind most of proposals Paul has spent a lot of time (iterated amplification, imitative generalization) on are explained clearly and succinctly.
For a quick summary, this involves
A proposal for useful ML-systems designed with human feedback
An argument that the human-feedback ML-systems will have flaws that kill you
A proposal for using ML assistants to debug the original ML system
An argument that the ML systems will not be able to understand the original human-feedback ML-systems
A proposal for training the human-feedback ML-systems in a way that requires understandability
An argument that this proposal is uncompetitive
??? (I believe the proposals in the ELK document are the next step here)
A key problem when evaluating very high-IQ, impressive, technical work, is that it is unclear which parts of the work you do not understand because you do not understand an abstract technical concept, and which parts are simply judgment calls based on the originator of the idea. This post shows very clearly which is which — many of the examples and discussions are technical, but the standard for “plausible failure story” and “sufficiently precise algorithm” and “sufficiently doomed” are all judgment calls, as are the proposed solutions. I’m not even sure I get on the bus at step 1, that the right next step is to consider ML assistants, it seems plausible this is already begging-the-question of how to have aligned assistants, and maybe the whole system will recurse later but that’s not something I’d personally ever gamble on. I still find the ensuing line of attack interesting.
The post has a Q&A addressing natural and common questions, which is super helpful.
As with a number of other posts I have reviewed and given a +9 to, I just realized that I already wrote a curation notice back when it came out. I should check just how predictive my curation notices are (and the false-positive rate), it’s interesting if I knew most of my favorite posts the moment they came out.
Returning to this essay, it continues to be my favorite Paul post (even What Failure Looks Like only comes second), and I think it’s the best way to engage with Paul’s work than anything else (including the Eliciting Latent Knowledge document, which feels less grounded in the x-risk problem, is less in Paul’s native language, and gets detailed on just one idea for 10x the space thus communicating less of the big picture research goal). I feel I can understand all the arguments made in this post. I think this should be mandatory reading before reading Eliciting Latent Knowledge.
Overview of why:
The motivation behind most of proposals Paul has spent a lot of time (iterated amplification, imitative generalization) on are explained clearly and succinctly.
For a quick summary, this involves
A proposal for useful ML-systems designed with human feedback
An argument that the human-feedback ML-systems will have flaws that kill you
A proposal for using ML assistants to debug the original ML system
An argument that the ML systems will not be able to understand the original human-feedback ML-systems
A proposal for training the human-feedback ML-systems in a way that requires understandability
An argument that this proposal is uncompetitive
??? (I believe the proposals in the ELK document are the next step here)
A key problem when evaluating very high-IQ, impressive, technical work, is that it is unclear which parts of the work you do not understand because you do not understand an abstract technical concept, and which parts are simply judgment calls based on the originator of the idea. This post shows very clearly which is which — many of the examples and discussions are technical, but the standard for “plausible failure story” and “sufficiently precise algorithm” and “sufficiently doomed” are all judgment calls, as are the proposed solutions. I’m not even sure I get on the bus at step 1, that the right next step is to consider ML assistants, it seems plausible this is already begging-the-question of how to have aligned assistants, and maybe the whole system will recurse later but that’s not something I’d personally ever gamble on. I still find the ensuing line of attack interesting.
The post has a Q&A addressing natural and common questions, which is super helpful.
As with a number of other posts I have reviewed and given a +9 to, I just realized that I already wrote a curation notice back when it came out. I should check just how predictive my curation notices are (and the false-positive rate), it’s interesting if I knew most of my favorite posts the moment they came out.