Noosphere89 comments on A case for capabilities work on AI as net positive

Noosphere89 27 Feb 2023 22:56 UTC
2 points
0
A major way I could be more pessimistic is if the deceptive alignment is 1% likely post was wrong in several aspects, and if that happened, I’d probably revise my beliefs back to where it was originally, at 30-60%.

Another way I could be wrong is if evidence came out that there are more ways for models to be deceptive than the standard scenario of deceptive alignment.

Finally, if the natural abstractions hypothesis didn’t hold, or if goodharting was empiricaly shown to get worse with model capabilities, then I’d update towards quite a bit more pessimism, on the order of at least 20-30% less confidence than I currently hold.
- SomeoneYouOnceKnew 27 Feb 2023 23:02 UTC
  1 point
  0
  Parent
  Thanks! Though, hm.
  
  Now I’m noodling how one would measure goodharting.