What kind of information would you look out for, that would make you change your mind about alignment-by-default?
What information would cause you to inverse again?
What information would cause you to adjust 50% down? 25%?
I know that most probability mass is some measure of gutfeel, and I don’t want to introduce nickel-and-diming, more get a feel for what information you’re looking for.
A major way I could be more pessimistic is if the deceptive alignment is 1% likely post was wrong in several aspects, and if that happened, I’d probably revise my beliefs back to where it was originally, at 30-60%.
Another way I could be wrong is if evidence came out that there are more ways for models to be deceptive than the standard scenario of deceptive alignment.
Finally, if the natural abstractions hypothesis didn’t hold, or if goodharting was empiricaly shown to get worse with model capabilities, then I’d update towards quite a bit more pessimism, on the order of at least 20-30% less confidence than I currently hold.
What kind of information would you look out for, that would make you change your mind about alignment-by-default?
What information would cause you to inverse again? What information would cause you to adjust 50% down? 25%?
I know that most probability mass is some measure of gutfeel, and I don’t want to introduce nickel-and-diming, more get a feel for what information you’re looking for.
A major way I could be more pessimistic is if the deceptive alignment is 1% likely post was wrong in several aspects, and if that happened, I’d probably revise my beliefs back to where it was originally, at 30-60%.
Another way I could be wrong is if evidence came out that there are more ways for models to be deceptive than the standard scenario of deceptive alignment.
Finally, if the natural abstractions hypothesis didn’t hold, or if goodharting was empiricaly shown to get worse with model capabilities, then I’d update towards quite a bit more pessimism, on the order of at least 20-30% less confidence than I currently hold.
Thanks! Though, hm.
Now I’m noodling how one would measure goodharting.