Next, we might imagine GPT-N to just be an Oracle AI, which we would have better hopes of using well. But I don’t expect that an approximate Oracle AI could be used safely with anything like the precautions that might work for a genuine Oracle AI. I don’t know what internal optimizers GPT-N ends up building along the way, but I’m not going to count on there being none of them.
Is the distinguishing feature between Oracle AI and approximate Oracle AI, as you use the terms here, just about whether there are inner optimizers or not?
(When I started the paragraph I assumed “approximate Oracle AI” just meant an Oracle AI whose predictions aren’t very reliable. Given how the paragraph ends though, I conclude that whether there are inner optimizers is an important part of the distinction you’re drawing. But I’m just not sure if it’s the whole of the distinction or not.)
The outer optimizer is the more obvious thing: it’s straightforward to say there’s a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way. Inner optimizers are an illustration of another failure mode.
The outer optimizer is the more obvious thing: it’s straightforward to say there’s a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way.
GPT generates text by repeatedly picking whatever word seems highest probability given all the words that came before. So if its notion of “highest probability” is almost, but not quite, answering every question accurately, I would expect a system which usually answers questions accurately but sometimes answers them inaccurately. That doesn’t sound very scary?
Is the distinguishing feature between Oracle AI and approximate Oracle AI, as you use the terms here, just about whether there are inner optimizers or not?
(When I started the paragraph I assumed “approximate Oracle AI” just meant an Oracle AI whose predictions aren’t very reliable. Given how the paragraph ends though, I conclude that whether there are inner optimizers is an important part of the distinction you’re drawing. But I’m just not sure if it’s the whole of the distinction or not.)
The outer optimizer is the more obvious thing: it’s straightforward to say there’s a big difference in dealing with a superhuman Oracle AI with only the goal of answering each question accurately, versus one whose goals are only slightly different from that in some way. Inner optimizers are an illustration of another failure mode.
GPT generates text by repeatedly picking whatever word seems highest probability given all the words that came before. So if its notion of “highest probability” is almost, but not quite, answering every question accurately, I would expect a system which usually answers questions accurately but sometimes answers them inaccurately. That doesn’t sound very scary?
Got it. Thanks!