Given the following conditions, is it possible to approximate the coherent extrapolated value of humanity to a “good enough” level?:
Some form of reward/cost function estimation is used, such as inverse reinforcement learning or inverse optimal control. The details of the specific IRL/IOC algorithm in question are not important, just the fact that the reward/cost function is estimated. For the unfamiliar, IRL is essentially the opposite of traditional reinforcement learning in that, given a set of observations and actions, it tries to determine the reward, utility, or value function (all interchangeable) of the agent(s) that generated the set.
An agent is able to observe another (presumably human) agent’s behaviour and update their estimate of the reward function based on that, without having direct sensory input from them. This is essentially what mirror neurons do. Technologically-speaking, this is probably the most difficult part to achieve, but not too important for the purpose of this question.
Here is my reasoning to believe that this approximation will in fact work:
First, we assume that all these constraints are true.
The estimated reward function is continuously updated with data from every individual it meets, using some form of weighted experience replay system so as to not overwrite previously-learned information.
Given that IRL/IOC can already estimate the reward function of one agent, or even a specific class of agents such as streaked shearwater birds¹, with a sufficiently complex system this algorithm should be able to extend to complex (read: human) agents.
As the number of observations n approaches infinity (or some sufficiently large number), the reward function should approach a reward function that is a “good enough” approximation of the coherent extrapolated value of humanity.
Note that there does not need to exist some actual reward function that is natively used by real humans, evaluated by their brain. As long as human behaviour can be sufficiently approximated by a neural network, this will hold; given the wide abilities of neural networks, from classifiers to learning agents to machine translation, I don’t see this as too much of a stretch.
However, I do anticipate certain objections to this explanation. Let me run through a few of them.
Humans are too complex to have their behaviours estimated by inverse reinforcement learning.
This seems to me like an argument from human exceptionalism, or anthropocentrism. I don’t see any reason for this to be true. Various animals already demonstrate many behaviours considered by anthropocentrists to be unique to humans, such as tool use in various primates and birds, as well as the ability of crows to recognise faces, parrots to mimic speech and perform math. From these examples, I don’t see any compelling arguments for an anthropocentric objection to this approach.
Getting the input and output necessary to perform online (i.e. real-time) inverse reinforcement learning is infeasible.
This is one of the most compelling counterarguments to this approach. However, I think that even if recreating “mirror neurons” (i.e. sensory neurons that fire both when the agent does something or observes someone do something) is too difficult, another approach could be used. A sufficiently-realistic VRMMORPG-like environment (Virtual Reality Massively Multiplayer Online Role-Playing Game) could be used to collect input sensory data and behaviours from players. If players are properly incentivised to act as they would in a real environment, then with a sufficient amount of pre-training, a “close-enough” approximation of the CEV should be possible.
“Close-enough” doesn’t even mean anything!
This is also an issue, yes. There are a number of ways to define “close-enough”, but I choose to leave the choice of which up to you. Some examples are: “functionally indistinguishable”, “functionally indistinguishable within a society”, “functionally indistinguishable within an intra-societal community”, or “functionally indistinguishable within a small group”. These aren’t exhaustive, and I can see any number of ways to define “close-enough”.
What do you mean by approximating the CEV? Isn’t it by definition incomprehensible to ourselves when extrapolated so far out? Doesn’t that mean it would be impossible to approximate it from individual observations?
This is where it gets dicey. Since we don’t know the CEV, how do we know if we have successfully approximated it? Is it even able to be approximated? One of the issues I thought of while writing this is that individual human behaviour may not converge to the CEV. My expectation is that as the number of humans behaviours have been sampled from, as well as the number of samples taken from each individual human, approaches the volition an individual would have if they had the same resources as the entirety of the observed population. My assumption is that this is equivalent to the CEV, which may not be true.
However, I’d be interested to see if there are any rebuttals to my responses to these counterarguments, as well as any counterarguments that I didn’t bring up, of which there are definitely many. Also, if I made any mistakes or if anything in this post isn’t clear, feel free to ask and I’ll clarify it.
Footnotes
Hirakawa, Tsubasa, Takayoshi Yamashita, Toru Tamaki, Hironobu Fujiyoshi, Yuta Umezu, Ichiro Takeuchi, Sakiko Matsumoto, and Ken Yoda. “Can AI Predict Animal Movements? Filling Gaps in Animal Trajectories Using Inverse Reinforcement Learning.” Ecosphere 9, no. 10 (2018): N/a.
Did you read Rohin Shah’s value learning sequence? It covers this whole area in a good amount of detail, and I think answers your question pretty straightforwardly:
Here is a link to the full sequence: https://www.lesswrong.com/s/4dHMdK5TLN6xcqtyc
Fwiw the quoted section was written by Paul Christiano, and I have used that blog post in my sequence (with permission).
Also, for this particular question you can read just Chapter 1 of the sequence.
Ah, yes. Sorry. Should have made the authorship that quote clearer.
Thank you for your feedback! I haven’t read this yet, but it comes pretty close to a discussion I had with a friend over this post.
Essentially, her argument started with a simple counterargument: She bought peanut M&Ms when she didn’t want to, and didn’t realise she was doing it until afterwards. In a similar situation where she was hungry and in the same place, she desired peanut M&Ms to satisfy her hunger, but this time she didn’t want them. She knew she didn’t want peanut M&Ms, and didn’t consciously decide to get them against that want; in this sense, I think a parallel can be drawn with akrasia, where rationality alone isn’t enough.
Her point was this: There has to be a line drawn between “intentional conscious action” and “the result of a complex system of interacting parts that puppets the meat sack that holds our brain, sometimes in ways we don’t intend.” On a base level, this could result in, say, an AI that acts like a normal human but sometimes buys peanut M&Ms against their volition. On an agent-based level where an AI is no more or less capable than a human, this isn’t much of an issue, and such things could make individual AI agents more convincing.
But if you want to make a superintelligent AI to run your ideal utopia, you don’t want it to decide to feed everyone peanut M&Ms against their will on a whim.
The biggest issue is that we can’t determine the difference between “intentional action” and “unintentional response”. If we could, then it would then (according to her) be trivial to find out what the CEV of humanity is, no estimation needed.
My largest assumption was that the lowest common denominator of human behaviour is “principled reasoning in pursuit of fixed, though unstated, goals”. More realistically, as another friend (and the post you linked) pointed out, the lowest common denominator of human behaviour is going to be “reproduce”, which has very unfortunate implications for the Friendliness of this hypothetical agent.
A number of things could be done to ameliorate this, such as not including any means to reproduce or any data supporting reproduction in the trajectories, but they all seem inadequate or ad-hoc. I don’t want to staple together a bunch of things I barely understand and declare it the Solution To AI (not that I was attempting to do that, anyway), especially when the issue isn’t necessarily with the technology and theory. As the peanut-M&M-purchasing friend put, the technology is sufficient but this post overestimates humans. This wasn’t actually what I expected to have an issue on, and it shifts it from “improve technology and theories” to… what, “improve humans”? I’m at a loss as to where to go from here; inverse reinforcement learning has a demonstrable use-case and benefits, but the data is… not good. Garbage in gives garbage out. Is it really possible to improve human behaviour (or our analysis/collection of human behaviour) to achieve better results?
There’s a lot of speculation about related-ish topics in Chapter 3 of the sequence linked above.