Right, the word “feasibly” is referring to the bullet point that starts “Maybe “Reward is connected to the abstract concept of ‘I want to be able to sing well’?””. Here’s a little toy example we can run with: teaching an AGI “don’t kill all humans”. So there are three approaches to reward design that I can think of, and none of them seem to offer a feasible way to do this (at least, not with currently-known techniques):
The agent learns by experiencing the reward. This doesn’t work for “don’t kill all humans” because when the reward happens it’s too late.
The reward calculator is sophisticated enough to understand what the agent is thinking, and issue rewards proportionate to the probability that the current thoughts and plans will eventually lead to the result-in-question happening. So the AGI thinks “hmm, maybe I’ll blow up the sun”, and the reward calculator recognizes that merely thinking that thought just now incrementally increased the probability that the AGI will kill all humans, and so it issues a negative reward. This is tricky because the reward calculator needs to have an intelligent understanding of the world, and of the AGI’s thoughts. So basically the reward calculator is itself an AGI, and now we need to figure out its rewards. I’m personally quite pessimistic about approaches that involve towers-of-AGIs-supervising-other-AGIs, for reasons in section 3.2 here, although other people would disagree with me on that (partly because they are assuming different AGI development paths and architectures than I am).
Same as above, but instead of a separate reward calculator estimating the probability that a thought or plan will lead to the result-in-question, we allow the AGI itself to do that estimation, by flagging a concept in its world-model called “I will kill all humans”, and marking it as “very bad and important” somehow. (The inspiration here is a human who somehow winds up with the strong desire “I want to get out of debt”. Having assigned value to that abstract concept, the human can assess for themselves the probabilities that different thoughts will increase or decrease the probability of that thing happening, and sorta issue themselves a reward accordingly.) The tricky part is (A) making sure that the AGI does in fact have that concept in its world-model (I think that’s a reasonable assumption, at least after some training), (B) finding that concept in the massive complicated opaque world-model, in order to flag it. So this is the symbol-grounding problem I mentioned in the text. I can imagine solving it if we had really good interpretability techniques (techniques that don’t currently exist), or maybe there are other methods, but it’s an unsolved problem as of now.
I’m all for doing lots of testing in simulated environments, but the real world is a whole lot bigger and more open and different than any simulation. Goals / motivations developed in a simulated environment might or might not transfer to the real world in the way you, the designer, were expecting.
So, maybe, but for now I would call that “an intriguing research direction” rather than “a solution”.
That is true, the desired characteristics may not develop as one would hope in the real world. Though that is the case for all training, not just AGI. Humans, animals, even plants, do not always develop along optimal lines even with the best ‘training’, when exposed to the real environment. Perhaps the solution you are seeking for, one without the risk of error, does not exist.
Right, the word “feasibly” is referring to the bullet point that starts “Maybe “Reward is connected to the abstract concept of ‘I want to be able to sing well’?””. Here’s a little toy example we can run with: teaching an AGI “don’t kill all humans”. So there are three approaches to reward design that I can think of, and none of them seem to offer a feasible way to do this (at least, not with currently-known techniques):
The agent learns by experiencing the reward. This doesn’t work for “don’t kill all humans” because when the reward happens it’s too late.
The reward calculator is sophisticated enough to understand what the agent is thinking, and issue rewards proportionate to the probability that the current thoughts and plans will eventually lead to the result-in-question happening. So the AGI thinks “hmm, maybe I’ll blow up the sun”, and the reward calculator recognizes that merely thinking that thought just now incrementally increased the probability that the AGI will kill all humans, and so it issues a negative reward. This is tricky because the reward calculator needs to have an intelligent understanding of the world, and of the AGI’s thoughts. So basically the reward calculator is itself an AGI, and now we need to figure out its rewards. I’m personally quite pessimistic about approaches that involve towers-of-AGIs-supervising-other-AGIs, for reasons in section 3.2 here, although other people would disagree with me on that (partly because they are assuming different AGI development paths and architectures than I am).
Same as above, but instead of a separate reward calculator estimating the probability that a thought or plan will lead to the result-in-question, we allow the AGI itself to do that estimation, by flagging a concept in its world-model called “I will kill all humans”, and marking it as “very bad and important” somehow. (The inspiration here is a human who somehow winds up with the strong desire “I want to get out of debt”. Having assigned value to that abstract concept, the human can assess for themselves the probabilities that different thoughts will increase or decrease the probability of that thing happening, and sorta issue themselves a reward accordingly.) The tricky part is (A) making sure that the AGI does in fact have that concept in its world-model (I think that’s a reasonable assumption, at least after some training), (B) finding that concept in the massive complicated opaque world-model, in order to flag it. So this is the symbol-grounding problem I mentioned in the text. I can imagine solving it if we had really good interpretability techniques (techniques that don’t currently exist), or maybe there are other methods, but it’s an unsolved problem as of now.
Could the hypothetical AGI be developed in a simulated environment and trained with proportionally lower consequences?
I’m all for doing lots of testing in simulated environments, but the real world is a whole lot bigger and more open and different than any simulation. Goals / motivations developed in a simulated environment might or might not transfer to the real world in the way you, the designer, were expecting.
So, maybe, but for now I would call that “an intriguing research direction” rather than “a solution”.
That is true, the desired characteristics may not develop as one would hope in the real world. Though that is the case for all training, not just AGI. Humans, animals, even plants, do not always develop along optimal lines even with the best ‘training’, when exposed to the real environment. Perhaps the solution you are seeking for, one without the risk of error, does not exist.