Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 17 Dec 2021 20:08 UTC
12 points
When I saw this cool new OpenAI paper, I thought of Yudkowsky’s Law of Earlier/Undignified Failure:
WebGPT: Improving the factual accuracy of language models through web browsing (openai.com)
Relevant quote:
In addition to these deployment risks, our approach introduces new risks at train time by giving the model access to the web. Our browsing environment does not allow full web access, but allows the model to send queries to the Microsoft Bing Web Search API and follow links that already exist on the web, which can have side-effects. From our experience with GPT-3, the model does not appear to be anywhere near capable enough to dangerously exploit these side-effects. However, these risks increase with model capability, and we are working on establishing internal safeguards against them.
To be clear I am not criticizing OpenAI here; other people would have done this anyway even if they didn’t. I’m just saying: It does seem like we are heading towards a world like the one depicted in What 2026 Looks Like where by the time AIs develop the capability to strategically steer the future in ways unaligned to human values… they are already roaming freely around the internet, learning constantly, and conversing with millions of human allies/followers. The relevant decision won’t be “Do we let the AI out of the box?” but rather “Do we petition the government and tech companies to shut down an entire category of very popular and profitable apps, and do it immediately?”
- gwern 17 Dec 2021 21:59 UTC
  4 points
  Parent
  “Tool AIs want to be agent AIs.”