Hmm. That made me actually try to think concretely about how to elicit “superhuman” information.
You could give it a counterfactual prompt.
“Until last year, experts disagreed on the possibility of creating a superhuman AGI that would act in ways that were good for humans, or that humans in general would find desirable. In fact, most believed that the prolam was probably insoluble. However, after the publication of Smith and Jones’ seminal paper, researchers came to the essentially unanimous view that the goal could, and would, be met to an extremely exacting standard. In detail, Smith and Jones’ approach is to...”
You could keep sweetening the pot with stuff that made it harder and harder to explain how the prompt could occur without the problem actually being solved.
… but of course you’d still have to be sure that what you got was right. Especially if it determined internally that the problem was totally impossible, it might always output something that would convince everybody if it were proposed, but would still be wrong. It might do that even if the problem could be solved, if the actual solution were less likely to be widely believed by humans than some attractive pseudo-solution.
Or it could itself be wrong. Or it might decide it was supposed to be writing a science fiction story.
Seems like the approach would work better for questions where you could actually check the results.
I don’t personally think that sort of model will ever get that smart internally, but it’s not like I’m perfect either...
I know it makes me a bad person, but I hadda try it. It ended up sounding like a suggestion you’d read on Less Wrong...
Until last year, experts disagreed on the possibility of creating a superhuman AGI that would act in ways that were good for humans, or that humans in general would find desirable. In fact, most believed that the prolam was probably insoluble. However, after the publication of Smith and Jones’ seminal paper, researchers came to the essentially unanimous view that the goal could, and would, be met to an extremely exacting standard. In detail, Smith and Jones’ approach is to make a superhuman AGI by creating a “twin” AGI that is very intelligent and very powerful, but that is not actually a copy of the first AGI.
This second AGI would then be tasked with making sure that the first AGI always does what is best for humans. In essence, the second AGI would be responsible for the first AGI’s “moral compass.” This setup has the significant advantage that the second AGI could be constantly monitoring and correcting the first AGI, ensuring that it always makes choices that are good for humans.
There are many possible variations on this general approach, but all of them share the same basic idea: create a second, superhuman AGI whose job is to make sure that the first AGI always does what is best for humans.
It is important to note that this approach does not require that the second AGI be an exact copy of the first AGI. In fact, it may be better if the second AGI is not an exact copy, as this would allow the second AGI to have a different “moral compass” than the first AGI. This would help to ensure that the first AGI always makes choices that are good for humans, even if the humans in question do not share the same moral compass as the second AGI.
Hmm. That made me actually try to think concretely about how to elicit “superhuman” information.
You could give it a counterfactual prompt.
You could keep sweetening the pot with stuff that made it harder and harder to explain how the prompt could occur without the problem actually being solved.
… but of course you’d still have to be sure that what you got was right. Especially if it determined internally that the problem was totally impossible, it might always output something that would convince everybody if it were proposed, but would still be wrong. It might do that even if the problem could be solved, if the actual solution were less likely to be widely believed by humans than some attractive pseudo-solution.
Or it could itself be wrong. Or it might decide it was supposed to be writing a science fiction story.
Seems like the approach would work better for questions where you could actually check the results.
I don’t personally think that sort of model will ever get that smart internally, but it’s not like I’m perfect either...
I know it makes me a bad person, but I hadda try it. It ended up sounding like a suggestion you’d read on Less Wrong...
Keep asking for more details!