I expect the results of my main research project (reverse-engineering human social instincts) to be publishable:
I don’t expect that publishing would increase socially-capable-AI. High-functioning sociopaths, who (to oversimplify somewhat) lack normal social instincts, are nevertheless very socially capable—in some ways they can be more socially capable than neurotypical people. If you think about it, an intelligent agent can form a good model of a car engine, and then use that model to skillfully manipulate the engine; well, by the same token, an intelligent agent can form a good model of a human, and then use that model to skillfully manipulate the human. You don’t need social instincts for that. Social instincts are mainly about motivations, not capabilities.
I expect that publishing would net decrease s-risks, not increase them. However, this is a long story that involves various hard-to-quantify considerations in both directions, and I think reasonable people can disagree about how they balance out. I have written down some sketchy notes trying to work through all the considerations, email me if you’re interested.
You didn’t bring this up, but I think there’s a small but nonzero chance that the story of social instincts will wind up involving aspects that I don’t want to publish because of concerns about speeding timelines-to-AGI, in which case I would probably endeavor to publish as much of the story as I could without saying anything problematic.
I expect that publishing would net decrease s-risks, not increase them. However
Yeah, I’d be interested in this, and will email you. That said, I’ll just lay out my concerns here for posterity. What generated my question in the first place was thinking “what could possibly go wrong with publishing a reward function for social instincts?” My brain helpfully suggested that someone would use it to cognitively-shape their AI in a half-assed manner because they thought that the reward function is all they would need. Next thing you know, we’re all living in super-hell[1].
You didn’t bring this up, but I think there’s a small but nonzero chance that the story of social instincts will wind up involving aspects that I don’t want to publish because of concerns about speeding timelines-to-AGI
You mind giving some hypothetical examples? This sounds plausible, but I’m struggling to think of concrete examples beyond vague thoughts like “maybe explaining social instincts involves describing a mechanism for sample efficient learning”.
If we think of brain within-lifetime learning as roughly a model-based RL algorithm, then
questions like “how exactly does this model-based RL algorithm work? what’s the model? how is it updated? what’s the neural architecture? how does the value function work? etc.” are all highly capabilities-relevant, and
the question “what is the reward function?” is mostly not capabilities-relevant.
There are exceptions—e.g. curiosity is part of the reward function but probably helpful for capabilities—but I don’t think social instincts are one of those exceptions. If social instincts are in versus out of the reward function, I think you get a powerful AGI either way—note that high-functioning sociopaths are generally intelligent and competent. More thorough discussion of this topic here.
So that’s basically why I’m optimistic that social instincts won’t be capabilities-relevant.
However, social instincts are probably not as simple as “a term in a reward function”, they’re probably somewhat more complicated than that, and it’s at least possible that there are aspects of how social instincts work that cannot be properly explained except in the context of a nuts-and-bolts understanding of the gory details of the model-based RL algorithm. I still think that’s unlikely, but it’s possible.
“what could possibly go wrong with publishing a reward function for social instincts?” My brain helpfully suggested that someone would use it to cognitively-shape their AI in a half-assed manner because they thought that the reward function is all they would need. Next thing you know, we’re all living in super-hell
A big question is: If I don’t reverse-engineer human social instincts, and nobody else does either, then what AGI motivations should we expect? Something totally random like a paperclip maximizer? Well, lots of reasonable people expect that, but I mostly don’t; I think there are pretty obvious things that future programmers can and will do that will get them into the realm of “the AGI’s motivations have some vague distorted relationship to humans and human values”, rather than “the AGI’s motivations are totally random” (e.g. see here). And if the AGI’s motivations are going to be at least vaguely related to humans and human values whether we like it or not, then by and large I think I’d rather empower future programmers with tools that give them more control and understanding, from an s-risk perspective.
I expect the results of my main research project (reverse-engineering human social instincts) to be publishable:
I don’t expect that publishing would increase socially-capable-AI. High-functioning sociopaths, who (to oversimplify somewhat) lack normal social instincts, are nevertheless very socially capable—in some ways they can be more socially capable than neurotypical people. If you think about it, an intelligent agent can form a good model of a car engine, and then use that model to skillfully manipulate the engine; well, by the same token, an intelligent agent can form a good model of a human, and then use that model to skillfully manipulate the human. You don’t need social instincts for that. Social instincts are mainly about motivations, not capabilities.
I expect that publishing would net decrease s-risks, not increase them. However, this is a long story that involves various hard-to-quantify considerations in both directions, and I think reasonable people can disagree about how they balance out. I have written down some sketchy notes trying to work through all the considerations, email me if you’re interested.
You didn’t bring this up, but I think there’s a small but nonzero chance that the story of social instincts will wind up involving aspects that I don’t want to publish because of concerns about speeding timelines-to-AGI, in which case I would probably endeavor to publish as much of the story as I could without saying anything problematic.
Yeah, I’d be interested in this, and will email you. That said, I’ll just lay out my concerns here for posterity. What generated my question in the first place was thinking “what could possibly go wrong with publishing a reward function for social instincts?” My brain helpfully suggested that someone would use it to cognitively-shape their AI in a half-assed manner because they thought that the reward function is all they would need. Next thing you know, we’re all living in super-hell[1].
You mind giving some hypothetical examples? This sounds plausible, but I’m struggling to think of concrete examples beyond vague thoughts like “maybe explaining social instincts involves describing a mechanism for sample efficient learning”.
Yes, that is an exaggeration, but I like the sentence.
If we think of brain within-lifetime learning as roughly a model-based RL algorithm, then
questions like “how exactly does this model-based RL algorithm work? what’s the model? how is it updated? what’s the neural architecture? how does the value function work? etc.” are all highly capabilities-relevant, and
the question “what is the reward function?” is mostly not capabilities-relevant.
There are exceptions—e.g. curiosity is part of the reward function but probably helpful for capabilities—but I don’t think social instincts are one of those exceptions. If social instincts are in versus out of the reward function, I think you get a powerful AGI either way—note that high-functioning sociopaths are generally intelligent and competent. More thorough discussion of this topic here.
So that’s basically why I’m optimistic that social instincts won’t be capabilities-relevant.
However, social instincts are probably not as simple as “a term in a reward function”, they’re probably somewhat more complicated than that, and it’s at least possible that there are aspects of how social instincts work that cannot be properly explained except in the context of a nuts-and-bolts understanding of the gory details of the model-based RL algorithm. I still think that’s unlikely, but it’s possible.
A big question is: If I don’t reverse-engineer human social instincts, and nobody else does either, then what AGI motivations should we expect? Something totally random like a paperclip maximizer? Well, lots of reasonable people expect that, but I mostly don’t; I think there are pretty obvious things that future programmers can and will do that will get them into the realm of “the AGI’s motivations have some vague distorted relationship to humans and human values”, rather than “the AGI’s motivations are totally random” (e.g. see here). And if the AGI’s motivations are going to be at least vaguely related to humans and human values whether we like it or not, then by and large I think I’d rather empower future programmers with tools that give them more control and understanding, from an s-risk perspective.