Stephen Fowler comments on Thoughts on the impact of RLHF research

Stephen Fowler 28 Jan 2023 3:29 UTC
1 point
0
Creating in vitro examples of problems analogous to the ones that will ultimately kill us, e.g. by showing agents engaging in treacherous turns due to reward hacking or exhibiting more and more of the core features of deceptive alignment.
Has ARC got a written policy for if/when similar experiment generate inconclusive but possible evidence of dangerous behaviour.

If so would you consider sharing it (or a non-confidential version) for other organisations to use.