There seems to be some confusion going on here—assuming an agent is accurately modeling the consequences of changing its own value function, and is not trying to hack around some major flaw in its own algorithm, it would never do so, as by definition, [correctly] optimizing a different value function can not improve the value of your current value function.
That is the opposite of what you said—Clippy, according to you, is maximizing the output of it’s critic network. And you can’t say “there’s not an explicit mathematical function”—any neural network with a specific set of weights is by definition an explicit mathematical function, just usually not a one with a compact representation.
What I was trying to say is that RL agents DO maximize the output of its critic network—but the critic network does not reflect states of the world directly. Therefore the total system isn’t directly a maximizer. The question I’m trying to pose is whether or not it acts like a maximizer, under given particular conditions of training and RL construction.
While you’re technically correct that an NN is a mathematical function, it seems fair to say that it’s not an explicit function in the sense that we can’t read or interpret it very well.
There seems to be some confusion going on here—assuming an agent is accurately modeling the consequences of changing its own value function, and is not trying to hack around some major flaw in its own algorithm, it would never do so, as by definition, [correctly] optimizing a different value function can not improve the value of your current value function.
Clippy isn’t a maximizer. And neither is any current RL agent. I did mention that, but I’ll edit to make that clear.
That is the opposite of what you said—Clippy, according to you, is maximizing the output of it’s critic network. And you can’t say “there’s not an explicit mathematical function”—any neural network with a specific set of weights is by definition an explicit mathematical function, just usually not a one with a compact representation.
What I was trying to say is that RL agents DO maximize the output of its critic network—but the critic network does not reflect states of the world directly. Therefore the total system isn’t directly a maximizer. The question I’m trying to pose is whether or not it acts like a maximizer, under given particular conditions of training and RL construction.
While you’re technically correct that an NN is a mathematical function, it seems fair to say that it’s not an explicit function in the sense that we can’t read or interpret it very well.