Algon comments on My AGI safety research—2022 review, ’23 plans

Algon 14 Dec 2022 19:34 UTC
LW: 3 AF: 3
0
AF
I expect that publishing would net decrease s-risks, not increase them. However
Yeah, I’d be interested in this, and will email you. That said, I’ll just lay out my concerns here for posterity. What generated my question in the first place was thinking “what could possibly go wrong with publishing a reward function for social instincts?” My brain helpfully suggested that someone would use it to cognitively-shape their AI in a half-assed manner because they thought that the reward function is all they would need. Next thing you know, we’re all living in super-hell^[1].
You didn’t bring this up, but I think there’s a small but nonzero chance that the story of social instincts will wind up involving aspects that I don’t want to publish because of concerns about speeding timelines-to-AGI
You mind giving some hypothetical examples? This sounds plausible, but I’m struggling to think of concrete examples beyond vague thoughts like “maybe explaining social instincts involves describing a mechanism for sample efficient learning”.
1. ^
  Yes, that is an exaggeration, but I like the sentence.
- Steven Byrnes 14 Dec 2022 20:20 UTC
  LW: 3 AF: 3
  1
  AF Parent
  You mind giving some hypothetical examples?
  If we think of brain within-lifetime learning as roughly a model-based RL algorithm, then
  - questions like “how exactly does this model-based RL algorithm work? what’s the model? how is it updated? what’s the neural architecture? how does the value function work? etc.” are all highly capabilities-relevant, and
  - the question “what is the reward function?” is mostly not capabilities-relevant.
  There are exceptions—e.g. curiosity is part of the reward function but probably helpful for capabilities—but I don’t think social instincts are one of those exceptions. If social instincts are in versus out of the reward function, I think you get a powerful AGI either way—note that high-functioning sociopaths are generally intelligent and competent. More thorough discussion of this topic here.
  So that’s basically why I’m optimistic that social instincts won’t be capabilities-relevant.
  However, social instincts are probably not as simple as “a term in a reward function”, they’re probably somewhat more complicated than that, and it’s at least possible that there are aspects of how social instincts work that cannot be properly explained except in the context of a nuts-and-bolts understanding of the gory details of the model-based RL algorithm. I still think that’s unlikely, but it’s possible.
  “what could possibly go wrong with publishing a reward function for social instincts?” My brain helpfully suggested that someone would use it to cognitively-shape their AI in a half-assed manner because they thought that the reward function is all they would need. Next thing you know, we’re all living in super-hell
  A big question is: If I don’t reverse-engineer human social instincts, and nobody else does either, then what AGI motivations should we expect? Something totally random like a paperclip maximizer? Well, lots of reasonable people expect that, but I mostly don’t; I think there are pretty obvious things that future programmers can and will do that will get them into the realm of “the AGI’s motivations have some vague distorted relationship to humans and human values”, rather than “the AGI’s motivations are totally random” (e.g. see here). And if the AGI’s motivations are going to be at least vaguely related to humans and human values whether we like it or not, then by and large I think I’d rather empower future programmers with tools that give them more control and understanding, from an s-risk perspective.
  What links here?
  - Steven Byrnes's comment on The Preference Fulfillment Hypothesis by Kaj_Sotala (27 Feb 2023 0:39 UTC; 11 points)
  - Kaj_Sotala's comment on The Preference Fulfillment Hypothesis by Kaj_Sotala (1 Mar 2023 15:35 UTC; 6 points)