Thank you for this sequence, which has a very interesting perspective and lots of useful info.
Just a quick note on the following section from your overview of “Honest AI” in this post:
What Researchers Are Doing Now
They are demonstrating that models can lie, and they are capturing true and false clusters inside models (this paper is forthcoming).
I was surprised not to see any mention of Eliciting Latent Knowledge (ELK) here. I guess part of it is about “demonstrating that models can lie”, but there is also all the solution-seeking happening by ARC and those who submitted proposals for the ELK prize.
You’re covering a lot of problem areas in this post so I don’t expect it to be comprehensive about every single one. Just curious if there’s any particular reason you chose to not to include ELK here.
Hi Evan! This post is focused on empirical ML approaches, as is PAIS overall. Certainly there is theoretical work that touches on many of the areas, but covering it all isn’t within the scope of this (it’s already pretty long).
The forthcoming paper mentioned (from Burns, Ye, Klein, and Steinhardt) could be viewed as an empirical attempt to do something similar to ELK.
Thank you for this sequence, which has a very interesting perspective and lots of useful info.
Just a quick note on the following section from your overview of “Honest AI” in this post:
I was surprised not to see any mention of Eliciting Latent Knowledge (ELK) here. I guess part of it is about “demonstrating that models can lie”, but there is also all the solution-seeking happening by ARC and those who submitted proposals for the ELK prize.
You’re covering a lot of problem areas in this post so I don’t expect it to be comprehensive about every single one. Just curious if there’s any particular reason you chose to not to include ELK here.
Hi Evan! This post is focused on empirical ML approaches, as is PAIS overall. Certainly there is theoretical work that touches on many of the areas, but covering it all isn’t within the scope of this (it’s already pretty long).
The forthcoming paper mentioned (from Burns, Ye, Klein, and Steinhardt) could be viewed as an empirical attempt to do something similar to ELK.