“Reinforcement learning” (RL) is not a technique. It’s a problem statement, i.e. a way of framing a task as an optimization problem, so you can hand it over to a mechanical optimizer.
What’s more, even calling it a problem statement is misleading, because it’s (almost) the most generalproblem statement possible for any arbitrary task. —Nostalgebraist 2020
If that’s right, then I am very reluctant to say anything whatsoever about “RL agents in general”. They’re too diverse.
Much of the post, especially the early part, reads (to me) like confident claims about all possible RL agents. For example, the excerpt “…reward is the antecedent-computation-reinforcer. Reward reinforces those computations which produced it.” sounds like a confident claim about all RL agents, maybe even by definition of “RL”. (If so, I think I disagree.)
But other parts of the post aren’t like that—for example, the “Does the choice of RL algorithm matter?” part seems more reasonable and hedged, and likewise there’s a mention of “real-world general RL agents” somewhere which maybe implies that the post is really only about that particular subset of RL agents, as opposed to all RL agents. (Right?)
For what it’s worth, I think “reward is the antecedent-computation-reinforcer” will probably be true in RL algorithms that scale to AGI, because it seems like generally the best and only type of technique that can solve the technical problem that it solves. But that’s a tricky thing to be super-duper-confident about, especially in the big space of all possible RL algorithms.
Another example spot where I want to make a weaker statement than you: where you say “Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal”. I would instead say “Deep reinforcement learning agents will not NECESSARILY come to intrinsically and primarily value their reward signal”. Do you have an argument that categorically rules out this possibility? I don’t see it.
FWIW I upvoted but disagree with the end part (hurray for more nuance in voting!)
I think “reward is the antecedent-computation-reinforcer” will probably be true in RL algorithms that scale to AGI
At least from my epistemic position there looks like an explanation/communication gap here: I don’t think we can be as confident of this. To me this claim seems to preclude ‘creative’ forward-looking exploratory behaviour and model-based planning, which have more of a probingness and less of a merely-antecedent-computation-reinforcingness. But I see other comments from you here which talk about foresighted exploration (and foresighted non-exploration!) and I know you’ve written about these things at length. How are you squaring/nuancing these things? (Silence or a link to an already-written post will not be deemed rude.)
I like this post, and basically agree, but it comes across somewhat more broad and confident than I am, at least in certain places.
I’m currently thinking about RL along the lines of Nostalgebraist here:
If that’s right, then I am very reluctant to say anything whatsoever about “RL agents in general”. They’re too diverse.
Much of the post, especially the early part, reads (to me) like confident claims about all possible RL agents. For example, the excerpt “…reward is the antecedent-computation-reinforcer. Reward reinforces those computations which produced it.” sounds like a confident claim about all RL agents, maybe even by definition of “RL”. (If so, I think I disagree.)
But other parts of the post aren’t like that—for example, the “Does the choice of RL algorithm matter?” part seems more reasonable and hedged, and likewise there’s a mention of “real-world general RL agents” somewhere which maybe implies that the post is really only about that particular subset of RL agents, as opposed to all RL agents. (Right?)
For what it’s worth, I think “reward is the antecedent-computation-reinforcer” will probably be true in RL algorithms that scale to AGI, because it seems like generally the best and only type of technique that can solve the technical problem that it solves. But that’s a tricky thing to be super-duper-confident about, especially in the big space of all possible RL algorithms.
Another example spot where I want to make a weaker statement than you: where you say “Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal”. I would instead say “Deep reinforcement learning agents will not NECESSARILY come to intrinsically and primarily value their reward signal”. Do you have an argument that categorically rules out this possibility? I don’t see it.
FWIW I upvoted but disagree with the end part (hurray for more nuance in voting!)
At least from my epistemic position there looks like an explanation/communication gap here: I don’t think we can be as confident of this. To me this claim seems to preclude ‘creative’ forward-looking exploratory behaviour and model-based planning, which have more of a probingness and less of a merely-antecedent-computation-reinforcingness. But I see other comments from you here which talk about foresighted exploration (and foresighted non-exploration!) and I know you’ve written about these things at length. How are you squaring/nuancing these things? (Silence or a link to an already-written post will not be deemed rude.)