I prompted GPT-3 with modified versions of Eliezer’s Beisutsukai stories, where I modified the “class project” to be about solving intent alignment instead of quantum gravity.
… Taji looked over his sheets. “Okay, I think we’ve got to assume that every avenue that Eld science was trying is a blind alley, or they would have found it. And if this is possible to do in one month, the answer must be, in some sense, elegant. So no human mistake models. If we start doing anything that looks like we should call it ‘utility function patching’, we’d better stop.”
He turned to Brennan. “Do you have any ideas?”
“I don’t think so,” she said. She glanced at her paper, frowning. “There are two problems here.”
One problem was simple enough: she didn’t understand what it meant for an AI to “want” to help a human. This was hard because as far as she could tell, there hadn’t been much research done on this question. It seemed to be something that only humans did, but there wasn’t any evidence that this was true. What was it about wanting? Was it a biological drive or a social instinct? A combination of both? More importantly, since it was a human concept, how could she measure if an AI behaved the way humans would in certain situations when the AI was obviously not human?
Then there was the question of practicality. The most elegant solution was not always the best. This question—could an AI desire to help its overseer?—was highly suited to explore this problem, but it still wouldn’t be a real test. Humans operated on a “helpfulness scale” which started from things like cashiers wishing you “have a nice day” and descending to Nazi concentration camp guards.
She wanted to craft an AI that would at least be a good clerk in today’s society. That was hardly an ambitious goal for a team of four highly intelligent teenagers. And yet, and yet...
“We can look at this differently,” Hyori said. “People are born with a blank mind. We can shape them however we want. AIs on the other hand, are born with 800 terabytes of behavioral examples from the moment they’re activated. The only data we have about unprogrammed AIs is that they either stay still or randomly move around. All this … it’s not making any confident in how easy this will be.” [?]
Brennan stopped writing and turned to look at her, frowning. “So what are you saying?”
“I don’t want to approach this problem by trying to divert the AI from its goal,” she said. “What if, instead of changing the mind of an AI, we instead changed the environment that an AI found itself in?”
The team fell silent.
Styrlyn broke the silence. “Uh...”
“What I mean is,” she said, “what if, instead of trying to divert the AI from one task, we created a situation where accomplishing two tasks would be more beneficial than accomplishing just one? We don’t need to patch new programs into the mind of an AI to make it want to help us. We can literally make helping us the most logical decision for it.”
I prompted GPT-3 with modified versions of Eliezer’s Beisutsukai stories, where I modified the “class project” to be about solving intent alignment instead of quantum gravity.
Full transcript.