I feel like I might be missing the big picture of what you’re saying here.
Why do you focus on episodic RL? Is it because you don’t want the AI to affect the choice of future test scenarios? If so, isn’t this approach very restrictive since it precludes the AI from considering long-term consequences?
A central point which you don’t seem to address here is where to get the “reward” signal. Unless you mean it’s always generated directly by a human operator? But such an approach seems very vulnerable to perverse incentives (AI manipulating humans / taking control of reward button). I think that it should be solved by some variant of IRL but you don’t discuss this.
Finally, a technical nitpick: It’s highly uncertain whether there is such a thing as a “good SAT solver” since there is no way to generate training data. More precisely, we know that there are optimal estimators for SAT with advice and there are noΓlog-optimal estimators without advice but we don’t know whether there are Γ0-optimal estimators (see Discussion section here).
EDIT: Actually we know there is no Γ0-optimal estimator. It might still be that there are optimal estimators for somewhat more special distributions which are still “morally generic” in a sense.
I focus on episodic RL because you can get lots of training examples. So you can use statistical learning theory to get nice bounds on performance. With long-term planning you can’t get guarantees through statistical learning theory alone (there are not nearly enough data points for long-term plans working or not working), you need some other approach (outside the paradigm of current machine learning).
I’m imagining something like ALBA or human-imitation. IRL is one of those capabilities I’m not comfortable assuming we’ll have (I agree with this post)
Hmm, it seems like the following procedure should work: Say we have an infinite list of SAT problems, a finite list of experts who try to solve SAT problems, and we want to get low regret (don’t solve many fewer than the best expert does). This is basically a bandit problem: we treat the experts as bandits, and interpret “choosing slot machine x” as “use expert x to try to solve the SAT problem”. So we can apply adversarial bandit algorithms to do well on this task asymptotically. I realize this is a simple model class, but it seems likely that a training procedure like this would generalize to e.g. neural nets. (I admit I haven’t read your paper yet).
You have guarantees for non-stationary environments in online learning and multi-bandits. I am just now working on how to transfer this to reinforcement learning. Briefly, you use an adversarial multi-bandit algorithm where the “arms” are policies, the reward is the undiscounted sum of rewards and the probability to switch policy (moving to the next round of the bandit) is the ratio between the values of the time discount function in consequent moments of time, so that the finite time undiscounted reward is an unbiased estimate of the infinite time discounted reward. This means you switch policy roughly once in a horizon.
I agree that defining imperfect rationality is a challenge, but I think it has to be solvable, otherwise I don’t understand what we mean by “human values” at all. I think that including bounded computational resources already goes a significant way towards modeling imperfection.
Of course we can train an algorithm to solve the candid search problem of SAT, i.e. find satisfying assignments if it can. What we can’t (easily) do is training an algorithm to solve the decision problem, i.e. telling us whether a circuit is satisfiable or not. Note that it might be possible to tell that a circuit is satisfiable even if it’s impossible to find the satisfying assignments (e.g. the circuit applies a one-way permutation and checks that the result is equal to a fixed string).
It seems like if you explore for the rest of your horizon, then by definition you explore for most of the time you actually care about. That seems bad. Perhaps I’m misunderstanding the proposal.
I agree that it’s solvable; the question is whether it’s any easier to do IRL well than it is to solve the AI alignment problem some other way. That seems unclear to me (it seems like doing IRL really well probably requires doing a lot of cognitive science and moral philosophy).
I agree that this seems hard to do as an episodic RL problem. It seems like we would need additional theoretical insights to know how to do this; we shouldn’t expect AI capabilities research in the current paradigm to automatically deliver this capability.
Re 1st bullet, I’m not entirely certain I understand the nature of your objection.
The agent I describe is asymptotically optimal in the sense that for any policy π, given γ(t) the discount function, U(t) the reward obtained by the agent from time t onwards and Uπ(t) the reward that would be obtained by the agent from time t onwards if it switched to following policy π at time t, we have that Eτ∼D(t)[γ(τ)−1(Uπ(τ)−U(τ))] is bounded by something that goes to 0 as t goes to ∞ for some family of time distributions D(t) which depends on γ (for geometric discount D is uniform from 0 to t).
It’s true that this desideratum seems much too weak for FAI since the agent would take much too long to learn. Instead, we want the agent to perform well already on the 1st horizon. This indeed requires a more sophisticated model.
This model can be considered analogous to episodic RL where horizons replace episodes. However, one difference of principle is that the agent retains information about the state of the environment when a new episode begins. I thinks this difference is a genuine advantage over “pure” episodic learning.
It seems like we’re mostly on the same page with this proposal. Probably something that’s going on is that the notion of “episodic RL” in my head is quite broad, to the point where it includes things like taking into account an ever-expanding history (each episode is “do the right thing in the next round, given the history”). But at that point it’s probably better to use a different formalism, such as the one you describe.
My objection was the one you acknowledge: “this desideratum seems much too weak for FAI”.
I feel like I might be missing the big picture of what you’re saying here.
Why do you focus on episodic RL? Is it because you don’t want the AI to affect the choice of future test scenarios? If so, isn’t this approach very restrictive since it precludes the AI from considering long-term consequences?
A central point which you don’t seem to address here is where to get the “reward” signal. Unless you mean it’s always generated directly by a human operator? But such an approach seems very vulnerable to perverse incentives (AI manipulating humans / taking control of reward button). I think that it should be solved by some variant of IRL but you don’t discuss this.
Finally, a technical nitpick: It’s highly uncertain whether there is such a thing as a “good SAT solver” since there is no way to generate training data. More precisely, we know that there are optimal estimators for SAT with advice and there are no Γlog-optimal estimators without advice but we don’t know whether there are Γ0-optimal estimators (see Discussion section here).
EDIT: Actually we know there is no Γ0-optimal estimator. It might still be that there are optimal estimators for somewhat more special distributions which are still “morally generic” in a sense.
I focus on episodic RL because you can get lots of training examples. So you can use statistical learning theory to get nice bounds on performance. With long-term planning you can’t get guarantees through statistical learning theory alone (there are not nearly enough data points for long-term plans working or not working), you need some other approach (outside the paradigm of current machine learning).
I’m imagining something like ALBA or human-imitation. IRL is one of those capabilities I’m not comfortable assuming we’ll have (I agree with this post)
Hmm, it seems like the following procedure should work: Say we have an infinite list of SAT problems, a finite list of experts who try to solve SAT problems, and we want to get low regret (don’t solve many fewer than the best expert does). This is basically a bandit problem: we treat the experts as bandits, and interpret “choosing slot machine x” as “use expert x to try to solve the SAT problem”. So we can apply adversarial bandit algorithms to do well on this task asymptotically. I realize this is a simple model class, but it seems likely that a training procedure like this would generalize to e.g. neural nets. (I admit I haven’t read your paper yet).
You have guarantees for non-stationary environments in online learning and multi-bandits. I am just now working on how to transfer this to reinforcement learning. Briefly, you use an adversarial multi-bandit algorithm where the “arms” are policies, the reward is the undiscounted sum of rewards and the probability to switch policy (moving to the next round of the bandit) is the ratio between the values of the time discount function in consequent moments of time, so that the finite time undiscounted reward is an unbiased estimate of the infinite time discounted reward. This means you switch policy roughly once in a horizon.
I agree that defining imperfect rationality is a challenge, but I think it has to be solvable, otherwise I don’t understand what we mean by “human values” at all. I think that including bounded computational resources already goes a significant way towards modeling imperfection.
Of course we can train an algorithm to solve the candid search problem of SAT, i.e. find satisfying assignments if it can. What we can’t (easily) do is training an algorithm to solve the decision problem, i.e. telling us whether a circuit is satisfiable or not. Note that it might be possible to tell that a circuit is satisfiable even if it’s impossible to find the satisfying assignments (e.g. the circuit applies a one-way permutation and checks that the result is equal to a fixed string).
It seems like if you explore for the rest of your horizon, then by definition you explore for most of the time you actually care about. That seems bad. Perhaps I’m misunderstanding the proposal.
I agree that it’s solvable; the question is whether it’s any easier to do IRL well than it is to solve the AI alignment problem some other way. That seems unclear to me (it seems like doing IRL really well probably requires doing a lot of cognitive science and moral philosophy).
I agree that this seems hard to do as an episodic RL problem. It seems like we would need additional theoretical insights to know how to do this; we shouldn’t expect AI capabilities research in the current paradigm to automatically deliver this capability.
Re 1st bullet, I’m not entirely certain I understand the nature of your objection.
The agent I describe is asymptotically optimal in the sense that for any policy π, given γ(t) the discount function, U(t) the reward obtained by the agent from time t onwards and Uπ(t) the reward that would be obtained by the agent from time t onwards if it switched to following policy π at time t, we have that Eτ∼D(t)[γ(τ)−1(Uπ(τ)−U(τ))] is bounded by something that goes to 0 as t goes to ∞ for some family of time distributions D(t) which depends on γ (for geometric discount D is uniform from 0 to t).
It’s true that this desideratum seems much too weak for FAI since the agent would take much too long to learn. Instead, we want the agent to perform well already on the 1st horizon. This indeed requires a more sophisticated model.
This model can be considered analogous to episodic RL where horizons replace episodes. However, one difference of principle is that the agent retains information about the state of the environment when a new episode begins. I thinks this difference is a genuine advantage over “pure” episodic learning.
It seems like we’re mostly on the same page with this proposal. Probably something that’s going on is that the notion of “episodic RL” in my head is quite broad, to the point where it includes things like taking into account an ever-expanding history (each episode is “do the right thing in the next round, given the history”). But at that point it’s probably better to use a different formalism, such as the one you describe.
My objection was the one you acknowledge: “this desideratum seems much too weak for FAI”.