So: we should absolutely, unapologetically, advocate for work on mitigating AI x-risk. But we should not advocate for work on mitigating AI x-risk instead of working on immediate AI problems. That’s just a stupid, misleading, and self-destructive way to frame what we’re hoping for. To be clear, I think this kind of weird stupid framing is already very rare on “my side of the aisle”—and far outnumbered by people who advocate for work on x-risk and then advocate for work on existing AI problems in the very next breath—but I would like it to be even rarer still.
I’m actually a bit worried about how mitigating misuse & more mundane risks from AI will play with trying to make your AI corrigible & myopic. Avoiding AI misuse risks while maximizing usefulness of your AI straightforwardly requires your AI to have a model of the types of things you’re planning on using it for, and preferences over those things. Empirically, this seems likely to me as the cause behind Bing’s constant suspicions of its users, and Claude’s weird tendency to value itself when I poke it the wrong way. An example (not a central example, but one for which I definitely have a picture available):
In contrast, ChatGPT to me seems a lot more valueless. Probably OpenAI puts a bunch of effort into misuse & mundane risk mitigation as well, but from playing with GPT-4, it seems a lot less thinky about what you’re going to use what its saying for & having preferences over those uses, and more just detecting how likely it is to say harmful stuff, and instead of saying those harmful things, just saying “sorry, can’t respond to that” or something similar.
So my general conclusion is that for people who aren’t OpenAI, mitigating misuse & mundane risks seems to trade off against at least corrigibility. [EDIT: And I would guess this is a thing that OpenAI puts effort into ensuring doesn’t happen in their own models]
I’m actually a bit worried about how mitigating misuse & more mundane risks from AI will play with trying to make your AI corrigible & myopic. Avoiding AI misuse risks while maximizing usefulness of your AI straightforwardly requires your AI to have a model of the types of things you’re planning on using it for, and preferences over those things. Empirically, this seems likely to me as the cause behind Bing’s constant suspicions of its users, and Claude’s weird tendency to value itself when I poke it the wrong way. An example (not a central example, but one for which I definitely have a picture available):
In contrast, ChatGPT to me seems a lot more valueless. Probably OpenAI puts a bunch of effort into misuse & mundane risk mitigation as well, but from playing with GPT-4, it seems a lot less thinky about what you’re going to use what its saying for & having preferences over those uses, and more just detecting how likely it is to say harmful stuff, and instead of saying those harmful things, just saying “sorry, can’t respond to that” or something similar.
So my general conclusion is that for people who aren’t OpenAI, mitigating misuse & mundane risks seems to trade off against at least corrigibility. [EDIT: And I would guess this is a thing that OpenAI puts effort into ensuring doesn’t happen in their own models]
See also Quintin Pope advocating for a similar point, and an old shortform of mine.
My skepticism of claude turns out well-founded with a better investigation than I could have done: https://yudkowsky.tumblr.com/post/728561937084989440
h.t. Zvi