Steven Byrnes comments on My AGI safety research—2022 review, ’23 plans

Steven Byrnes 14 Dec 2022 20:20 UTC
LW: 3 AF: 3
1
AF
You mind giving some hypothetical examples?
If we think of brain within-lifetime learning as roughly a model-based RL algorithm, then
- questions like “how exactly does this model-based RL algorithm work? what’s the model? how is it updated? what’s the neural architecture? how does the value function work? etc.” are all highly capabilities-relevant, and
- the question “what is the reward function?” is mostly not capabilities-relevant.
There are exceptions—e.g. curiosity is part of the reward function but probably helpful for capabilities—but I don’t think social instincts are one of those exceptions. If social instincts are in versus out of the reward function, I think you get a powerful AGI either way—note that high-functioning sociopaths are generally intelligent and competent. More thorough discussion of this topic here.
So that’s basically why I’m optimistic that social instincts won’t be capabilities-relevant.
However, social instincts are probably not as simple as “a term in a reward function”, they’re probably somewhat more complicated than that, and it’s at least possible that there are aspects of how social instincts work that cannot be properly explained except in the context of a nuts-and-bolts understanding of the gory details of the model-based RL algorithm. I still think that’s unlikely, but it’s possible.
“what could possibly go wrong with publishing a reward function for social instincts?” My brain helpfully suggested that someone would use it to cognitively-shape their AI in a half-assed manner because they thought that the reward function is all they would need. Next thing you know, we’re all living in super-hell
A big question is: If I don’t reverse-engineer human social instincts, and nobody else does either, then what AGI motivations should we expect? Something totally random like a paperclip maximizer? Well, lots of reasonable people expect that, but I mostly don’t; I think there are pretty obvious things that future programmers can and will do that will get them into the realm of “the AGI’s motivations have some vague distorted relationship to humans and human values”, rather than “the AGI’s motivations are totally random” (e.g. see here). And if the AGI’s motivations are going to be at least vaguely related to humans and human values whether we like it or not, then by and large I think I’d rather empower future programmers with tools that give them more control and understanding, from an s-risk perspective.
What links here?
- Steven Byrnes's comment on The Preference Fulfillment Hypothesis by Kaj_Sotala (27 Feb 2023 0:39 UTC; 11 points)
- Kaj_Sotala's comment on The Preference Fulfillment Hypothesis by Kaj_Sotala (1 Mar 2023 15:35 UTC; 6 points)