This post deserves a strong upvote. Since you’ve done the review, would you mind answering a reference request? What papers/blog posts represent Paul’s current views on corrigibility?
Thanks for the comment and I’m glad you like the post :)
On the other topic: I’m sorry, I’m afraid I can’t be very helpful here. I’d be somewhat surprised if I’d have had a good answer to this a year ago and certainly don’t have one now.
Some cop-out answers:
I often found reading his (discussions with others in) comments/remarks about corrigibility in posts focused on something else more useful to find out if his thinking changed on this than his blog posts that were obviously concentrating on corrigibility
You might have some luck reading through some of his newer blogposts and seeing if you can spot some mentions there
In case this was about “his current views” as opposed to “the views I tried to represent here which are one year old”: The comments he left are from this summer, so you can get some idea from there/maybe assume that he endorses the parts I wrote that he didn’t commented on (at least in the first third of the doc or so when he still left comments)
FWIW, I just had through my docs and found “resources” doc with the following links under corrigiblity:
Not vouching for any of those being the up-to-date or most relevant ones. I’m pretty sure I made this list early on in the process and it probably doesn’t represent what I considered the latest Paul-view.
This post deserves a strong upvote. Since you’ve done the review, would you mind answering a reference request? What papers/blog posts represent Paul’s current views on corrigibility?
Thanks for the comment and I’m glad you like the post :)
On the other topic: I’m sorry, I’m afraid I can’t be very helpful here. I’d be somewhat surprised if I’d have had a good answer to this a year ago and certainly don’t have one now.
Some cop-out answers:
I often found reading his (discussions with others in) comments/remarks about corrigibility in posts focused on something else more useful to find out if his thinking changed on this than his blog posts that were obviously concentrating on corrigibility
You might have some luck reading through some of his newer blogposts and seeing if you can spot some mentions there
In case this was about “his current views” as opposed to “the views I tried to represent here which are one year old”: The comments he left are from this summer, so you can get some idea from there/maybe assume that he endorses the parts I wrote that he didn’t commented on (at least in the first third of the doc or so when he still left comments)
FWIW, I just had through my docs and found “resources” doc with the following links under corrigiblity:
Clarifying AI alignment
Can corrigibility be learned safely?
Problems with amplification/distillation
The limits of corrigibility
Addressing three problems with counterfactual corrigibility
Corrigibility
Not vouching for any of those being the up-to-date or most relevant ones. I’m pretty sure I made this list early on in the process and it probably doesn’t represent what I considered the latest Paul-view.
Fair enough. Thanks for the recommendations. :)
Cross-linking to another thread: I just posted a long comment with more references to corrigibility resources in your post asking about corrigibility reading lists.
In that comment I focus on corrigibility related work that has appeared as scientific papers and/or arxiv preprints.