The model in this post is that in picking out Luigi from the sea of possible simulacra you’ve also gone most of the way to picking out Waluigi. This seems testable: do we see more Waluigi-like behavior from RHLF-trained GPT than from raw GPT?
Yeah would love to see experiments/evidence outside of Bing
The model in this post is that in picking out Luigi from the sea of possible simulacra you’ve also gone most of the way to picking out Waluigi. This seems testable: do we see more Waluigi-like behavior from RHLF-trained GPT than from raw GPT?
Yeah would love to see experiments/evidence outside of Bing