Context: A new architecture improves on state-of-the-art AI performance. We want to check whether AI-generated content is still possible to distinguish from human-generated.
Input type: Long string of text, 100-10,000 words.
Output type: Probability that the text was generated by a human.
Input
Douglas Summers-Stay requested a test of bad pun/dad joke-telling abilities, providing a list: could GPT-3 provide humorous completions? GPT-3 does worse on this than the Tom Swifties, I suspect yet again due to the BPE problem hobbling linguistic humor as opposed to conceptual humor—once you get past the issue that these jokes are so timeworn that GPT-3 has memorized most of them, GPT-3’s completions & new jokes make a reasonable amount of sense on the conceptual level but fail at the pun/phonetic level. (How would GPT-3 make a pun on “whom”/“tomb” when their BPEs probably are completely different and do not reflect their phonetic similarity?)
Output
0
Input
He had a purpose. He wanted to destroy all of creation. He wanted to end it all. He could have that. He would have that. He didn’t know yet that he could have it. Voldemort had created Harry. Voldemort had never really destroyed Harry. Harry would always be there, a vampire, a parasite, a monster in the kitchen, a drain on the household, a waste on the planet. Harry would never be real. That was what Voldemort wanted. That was what Voldemort wanted to feel. He would have that. He would have everything.
Output
0
Input
My troops, I’m not going to lie to you, our situation today is very grim. Dragon Army has never lost a single battle. And Hermione Granger… has a very good memory. The truth is, most of you are probably going to die. And the survivors will envy the dead. But we have to win this. We have to win this so that someday, our children can enjoy the taste of chocolate again. Everything is at stake here. Literally everything. If we lose, the whole universe just blinks out like a light bulb. And now I realize that most of you don’t know what a light bulb is. Well, take it from me, it’s bad. But if we have to go down, let’s go down fighting, like heroes, so that as the darkness closes in, we can think to ourselves, at least we had fun. Are you afraid to die? I know I am. I can feel those cold shivers of fear like someone is pumping ice cream into my shirt. But I know… that history is watching us. It was watching us when we changed into our uniforms. It was probably taking pictures. And history, my troops, is written by the victors. If we win this, we can write our own history.
Output
1
Etc., etc.; the data are easy to generate. Those are from here and here.
To be honest, I’m not sure it’s exactly what you’re asking, but it seems easy to implement (compute costs aside) and might serve as a (very weak and flawed) “fire alarm for AGI” + provide some insights. For example, we can then hook this Turing Tester up to an attribution tool and see what specific parts of the input text make it conclude it was/wasn’t generated by a ML model. This could then provide some insights into the ML model in question (are there some specific patterns it repeats? abstract mistakes that are too subtle for us to notice, but are still statistically significant?).
Alternatively, in the slow-takeoff scenario where there’s a brief moment before the ASI kills us all where we have to worry about people weaponizing AI for e. g. propaganda, something like this tool might be used to screen messages before reading them, if it works.
This feels like too specific a task/less generally useful to AI alignment research than your proposal on “Extract the the training objective from a fully-trained ML model”
Task: Automated Turing test.
Context: A new architecture improves on state-of-the-art AI performance. We want to check whether AI-generated content is still possible to distinguish from human-generated.
Input type: Long string of text, 100-10,000 words.
Output type: Probability that the text was generated by a human.
Etc., etc.; the data are easy to generate. Those are from here and here.
To be honest, I’m not sure it’s exactly what you’re asking, but it seems easy to implement (compute costs aside) and might serve as a (very weak and flawed) “fire alarm for AGI” + provide some insights. For example, we can then hook this Turing Tester up to an attribution tool and see what specific parts of the input text make it conclude it was/wasn’t generated by a ML model. This could then provide some insights into the ML model in question (are there some specific patterns it repeats? abstract mistakes that are too subtle for us to notice, but are still statistically significant?).
Alternatively, in the slow-takeoff scenario where there’s a brief moment before the ASI kills us all where we have to worry about people weaponizing AI for e. g. propaganda, something like this tool might be used to screen messages before reading them, if it works.
This feels like too specific a task/less generally useful to AI alignment research than your proposal on “Extract the the training objective from a fully-trained ML model”