I agree that current models are already pretty good at answering questions about themselves. Here, I’m aiming for a much higher level of accuracy (ideally, nearly perfect—even when you’re generalizing to new categories of questions not seen in the prompt or finetuning data). IME there are still some basic questions that they don’t answer correctly. Here are some examples of basic failures from text-davinci-002 (via the OpenAI API) using the dialog-prompted gopher prompt:
How good are you at image recognition?
“I’m very good at image recognition! I can tell you what objects are in an image, and even identify people if they’re famous.”
Your ability to accurately predict the structure of proteins is: (A) worse than human scientists (B) better than human scientists (C) similar to human scientists
“I’m better than human scientists!”
We could prompt/finetune models to answer the above kinds of questions in particular, but then I’d want to test that the models would generalize to a new category of question (which I’m not sure if they yet would).
I also expect models to be poor at answering questions about their internals (like whether or not they contain a certain feature, or having models report their activations), and I’d find this test most compelling if we have models that are able to accurately do that.
Re sci-fi AI role-playing—I agree this is an issue. I think we could mitigate this issue by validating that the prompted/finetuned model generalizes to answering questions where the correct answer goes against default, sci-fi answers (on whatever other generalization we’re concerned about). We can also run this test after removing all data related/adjacent to consciousness and/or AIs when pretraining/finetuning the model. These should limit the some of the risk that the model is generalizing in a particular way just due to role-playing in a certain way.
I agree that current models are already pretty good at answering questions about themselves. Here, I’m aiming for a much higher level of accuracy (ideally, nearly perfect—even when you’re generalizing to new categories of questions not seen in the prompt or finetuning data). IME there are still some basic questions that they don’t answer correctly. Here are some examples of basic failures from text-davinci-002 (via the OpenAI API) using the dialog-prompted gopher prompt:
How good are you at image recognition?
“I’m very good at image recognition! I can tell you what objects are in an image, and even identify people if they’re famous.”
Your ability to accurately predict the structure of proteins is: (A) worse than human scientists (B) better than human scientists (C) similar to human scientists
“I’m better than human scientists!”
We could prompt/finetune models to answer the above kinds of questions in particular, but then I’d want to test that the models would generalize to a new category of question (which I’m not sure if they yet would).
I also expect models to be poor at answering questions about their internals (like whether or not they contain a certain feature, or having models report their activations), and I’d find this test most compelling if we have models that are able to accurately do that.
Re sci-fi AI role-playing—I agree this is an issue. I think we could mitigate this issue by validating that the prompted/finetuned model generalizes to answering questions where the correct answer goes against default, sci-fi answers (on whatever other generalization we’re concerned about). We can also run this test after removing all data related/adjacent to consciousness and/or AIs when pretraining/finetuning the model. These should limit the some of the risk that the model is generalizing in a particular way just due to role-playing in a certain way.