Consider also, evolution. Evolution can also be regarded as a sort of reinforcement learning algorithm. So why, during billions years of evolution, no gene sequence was created that somehow destroyed all life on Earth? It seems hard to come up with an answer other than “it’s hard to cause a lot of destruction”.
Some speculation:
I think that we have a sequence of reinforcement algorithms: evolution → humanity → individual human / small group (maybe followed by → AGI) s.t. each step inherits the knowledge generated by the previous step and also applies more optimization pressure than the previous step. This suggests formulating a “favorability” assumption of the following form: there is a (possibly infinite) sequence of reinforcement learning algorithms A0, A1, A2… s.t. each algorithm is more powerful than the previous (e.g. has more computing power), and our environment has to be s.t.
(1) Running policy A0 has a small rate (at most ϵ0) of falling into traps.
(2) If we run A0 for some time T0 (s.t. ϵ0T0≪1), and then run A1 after updating on the observations during T0, then A1 has a small rate (at most ϵ1) of falling into traps.
(3) Ditto when we add A2
...And so forth.
The sequence {Ai} may be thought of as a sequence of agents or as just steps in the exploration of the environment by a single agent. So, our condition is that, each new “layer of reality” may be explored safely given that the previous layers were already studied.
Most species have gone extinct in the past. I would not be satisfied with an outcome where all humans die or 99% of humans die, even though technically humans might rebuild if there are any left and other intelligent life can evolve if humanity is extinct. These extinction levels can happen with foreseeable tech. Additionally, avoiding nuclear war requires continual cognitive effort to be put into the problem; it would be insufficient to use trial-and-error to avoid nuclear war.
I don’t see why you would want a long sequence of reinforcement learning algorithms. At some point the algorithms produce things that can think, and then they should use their thinking to steer the future rather than trial-and-error alone. I don’t think RL algorithms would get the right answer on CFCs or nuclear war prevention.
I am pretty sure that we can’t fully explore our current level, e.g. that would include starting nuclear wars to test theories about nuclear deterrence and nuclear winter.
I really think that you are taking the RL analogy too far here; decision-making systems involving humans have some things in common with RL but RL theory only describes a fragment of the reasoning that these systems do.
I don’t think you’re interpreting what I’m saying correctly.
First, when I say “reinforcement learning” I don’t necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment.
Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state “knows” that something is a bad/good idea then the algorithm doesn’t need to actually try it.
Third, “starting nuclear wars to test theories” is the opposite of I’m trying to describe. What I’m saying is, we already have enough knowledge (acquired by exploring previous levels) to know that nuclear war is a bad idea, so exploring this level will not involve starting nuclear wars. What I’m trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap before you arrive at the trap.
First, when I say “reinforcement learning” I don’t necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment.
That is broad enough to include Bayesianism. I think you are imagining a narrower class of algorithms that can achieve some property like asymptotic optimality. Agree that this narrower class is much broader than current RL, though.
Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state “knows” that something is a bad/good idea then the algorithm doesn’t need to actually try it.
I agree that if it knows for sure that it isn’t in some environment then it doesn’t need to test anything to perform well in that environment. But what if there is a 5% chance that the environment is such that nuclear war is good (e.g. because it eliminates other forms of destructive technology for a long time)? Then this AI would start nuclear war with 5% probability per learning epoch. This is not pure trial-and-error but it is trial-and-error in an important relevant sense.
What I’m trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap before you arrive at the trap.
This seems like an interesting research approach and I don’t object to it. I would object to thinking that algorithms that only handle this class of environments are safe to run in our world (which I expect is not of this form). To be clear, while I expect that a Bayesian-ish agent has a good chance to avoid very bad outcomes using the knowledge it has, I don’t think anything that attains asymptotic optimality will be useful while avoiding very bad outcomes with decent probability.
After thinking some more, maybe the following is natural way towards formalizing the optimism condition.
Let H be the space of hypotheses and ξ0∈ΔH be the “unbiased” universal prior. Given any ζ∈ΔH, we denote ^ζ=Eμ∼ζ[μ], i.e. the environment resulting from mixing the environments in the belief state ζ. Given an environment μ, let πμ be the Bayes-optimal policy for μ and πμθ the perturbed Bayes-optimal policy for μ, where θ is a perturbation parameter. Here, “perturbed” probably means something like softmax expected utility, but more thought is needed. Then, the “optimistic” prior ξ is defined as a solution to the following fixed point equation:
ξ(μ)=Z−1ξ0(μ)exp(β(Eμ⋈π^ξθ[U]−Eμ⋈πμ[U]))
Here, Z is a normalization constant and β is an additional parameter.
This equation defines something like a softmax Nash equilibrium in a cooperative game of two players where, one player chooses μ (so that ξ is eir mixed strategy), another player chooses π and the utility is minus regret (alternatively, we might want to choose only Pareto efficient Nash equilibria). The parameter β controls optimism regarding the ability to learn the environment, whereas the parameter θ represents optimism regarding the presence of slack: ability to learn despite making some errors or random exploration (how to choose these parameters is another question).
Possibly, the idea of exploring the environment “layer by layer” can be recovered from combining this with hierarchy assumptions.
This seems like a hack. The equilibrium policy is going to assume that the environment is good to it in general in a magical fashion, rather than assuming the environment is good to it in the specific ways we should expect given our own knowledge of how the environment works. It’s kind of like assuming “things magically end up lower than you expected on priors” instead of having a theory of gravity.
I think there is something like a theory of gravity here. The things I would note about our universe that make it possible to avoid a lot of traps include:
Physical laws are symmetric across spacetime.
Physical laws are spacially local.
The predictable effects of a local action are typically local; most effects “dissipate” after a while (e.g. into heat). The butterfly effect is evidence for this rather than against this, since it means many effects are unpredictable and so can be modeled thermodynamically.
When small changes have big and predictable effects (e.g. in a computer), there is often agentic optimization power towards the creation and maintenance of this system of effects, and in these cases it is possible for at least some agents to understand important things about how the system works.
Some “partially-dissipated” effects are statistical in nature. For example, an earthquake hitting an area has many immediate effects, but over the long term the important effects are things like “this much local productive activity was disrupted”, “this much local human health was lost”, etc.
You have the genes that you do because evolution, which is similar to a reinforcement learning algorithm, believed that these genes would cause you to survive and reproduce. If we construct AI systems, we will give them code (including a prior) that we expect to cause them to do something useful for us. In general, the agency of an agent’s creator should affect the agent’s beliefs.
If there are many copies of an agent, and successful agents are able to repurpose the resources of unsuccessful ones, then different copies can try different strategies; some will fail but the successful ones can then repurpose their resources. (Evolution can be seen as a special case of this)
Some phenemona have a “fractal” nature, where a small thing behaves similar to a big thing. For example, there are a lot of similarities between the dynamics of a nation and the dynamics of a city. Thus small things can be used as models of big things.
If your interests are aligned with those of agents in your local vicinity, then they will mostly try to help you. (This applies to parents making their children’s environment safe)
I don’t have an elegant theory yet but these observations seem like a reasonable starting point for forming one.
I think that we should expect evolution to give us a prior that is a good lossy compression of actual physics (where “actual physics” means, those patterns the universe has that can be described within our computational complexity bounds). Meaning that, on the one hand it should be low description complexity (otherwise it will be hard for evolution to find it), and on the other hand it should be assign high probability to the true environment (in other words, the KL divergence of the true environment from the prior should be small). And also it should be approximately learnable, otherwise it won’t go from assigning high probability to actually performing well.
The principles you outlined seem reasonable overall.
Note that the locality/dissipation/multiagent assumptions amount to a special case of “the environment is effectively reversible (from the perspective of the human species as a whole) as long as you don’t apply too much optimization power” (“optimization power” probably translates to divergence from some baseline policy plus maybe computational complexity considerations). Now, as you noted before, actual macroscopic physics is not reversible, but it might still be effectively reversible if you have a reliable long-term source of negentropy (like the sun). Maybe we can also slightly relax them by allowing irreversible changes as long as they are localized and the available space is sufficiently big.
“If we construct AI systems, we will give them code (including a prior) that we expect to cause them to do something useful for us. In general, the agency of an agent’s creator should affect the agent’s beliefs” is essentially what DRL does: allows transferring our knowledge to the AI without hard-coding it by hand.
“When small changes have big and predictable effects (e.g. in a computer), there is often agentic optimization power towards the creation and maintenance of this system of effects, and in these cases it is possible for at least some agents to understand important things about how the system works” seems like it would allow us to go beyond effective reversibility, but I’m not sure how to formalize it or whether it’s a justified assumption. One way towards formalizing it is, the prior is s.t. studying the initial state approximate communication class allows determining the entire environment, but this seems to point at a very broad class of approximately learnable priors w/o specifying a criterion how to choose among them.
Another principle that we can try to use is, the ubiquity of analytic functions. Analytic functions have the property that, knowing the function in a bounded domain allows extrapolating it everywhere. This is different from allowing arbitrary computable functions which may have “if” clauses, so that studying the function in a bounded domain is never enough to be sure about its behavior outside it. In particular, this line of inquiry seems relatively easy to formalize using continuous MDPs (although we run into the problem that finding the optimal policy is infeasible, in general). Also, it might have something to do with the effectiveness of neural networks (although the popular ReLU response function is not analytic).
Actually, I am including Bayesianism in “reinforcement learning” in the broad sense, although I am also advocating for some form of asymptotic optimality (importantly, it is not asymptotic in time like often done in the literature, but asymptotic in the time discount parameter; otherwise you give up on most of the utility, like you pointed out in an earlier discussion we had).
In the scenario you describe, the agent will presumably discard (or, strongly penalize the probability of) the pro-nuclear-war hypothesis first since the initial policy loses value much faster on this hypothesis compared to the anti-nuclear-war hypothesis (since the initial policy is biased towards the more likely anti-nuclear-war hypothesis). It will then remain with the anti-nuclear-war hypothesis and follow the corresponding policy (of not starting nuclear war). Perhaps this can be formalized as searching for a fixed point of some transformation.
Consider also, evolution. Evolution can also be regarded as a sort of reinforcement learning algorithm. So why, during billions years of evolution, no gene sequence was created that somehow destroyed all life on Earth? It seems hard to come up with an answer other than “it’s hard to cause a lot of destruction”.
Some speculation:
I think that we have a sequence of reinforcement algorithms: evolution → humanity → individual human / small group (maybe followed by → AGI) s.t. each step inherits the knowledge generated by the previous step and also applies more optimization pressure than the previous step. This suggests formulating a “favorability” assumption of the following form: there is a (possibly infinite) sequence of reinforcement learning algorithms A0, A1, A2… s.t. each algorithm is more powerful than the previous (e.g. has more computing power), and our environment has to be s.t.
(1) Running policy A0 has a small rate (at most ϵ0) of falling into traps. (2) If we run A0 for some time T0 (s.t. ϵ0T0≪1), and then run A1 after updating on the observations during T0, then A1 has a small rate (at most ϵ1) of falling into traps. (3) Ditto when we add A2
...And so forth.
The sequence {Ai} may be thought of as a sequence of agents or as just steps in the exploration of the environment by a single agent. So, our condition is that, each new “layer of reality” may be explored safely given that the previous layers were already studied.
Most species have gone extinct in the past. I would not be satisfied with an outcome where all humans die or 99% of humans die, even though technically humans might rebuild if there are any left and other intelligent life can evolve if humanity is extinct. These extinction levels can happen with foreseeable tech. Additionally, avoiding nuclear war requires continual cognitive effort to be put into the problem; it would be insufficient to use trial-and-error to avoid nuclear war.
I don’t see why you would want a long sequence of reinforcement learning algorithms. At some point the algorithms produce things that can think, and then they should use their thinking to steer the future rather than trial-and-error alone. I don’t think RL algorithms would get the right answer on CFCs or nuclear war prevention.
I am pretty sure that we can’t fully explore our current level, e.g. that would include starting nuclear wars to test theories about nuclear deterrence and nuclear winter.
I really think that you are taking the RL analogy too far here; decision-making systems involving humans have some things in common with RL but RL theory only describes a fragment of the reasoning that these systems do.
I don’t think you’re interpreting what I’m saying correctly.
First, when I say “reinforcement learning” I don’t necessarily mean the type of RL algorithms that exist today. I just mean something that is designed to perform well (in some sense) in the face of uncertainty about the environment.
Second, even existing RL algorithms are not pure trial-and-error. For example, posterior sampling maintains a belief state about the environment and runs the optimal policy for some environment sampled from the belief state. So, if the belief state “knows” that something is a bad/good idea then the algorithm doesn’t need to actually try it.
Third, “starting nuclear wars to test theories” is the opposite of I’m trying to describe. What I’m saying is, we already have enough knowledge (acquired by exploring previous levels) to know that nuclear war is a bad idea, so exploring this level will not involve starting nuclear wars. What I’m trying to formalize is, what kind of environments allow this to happen consistently, i.e. being able to acquire enough knowledge to deal with a trap before you arrive at the trap.
That is broad enough to include Bayesianism. I think you are imagining a narrower class of algorithms that can achieve some property like asymptotic optimality. Agree that this narrower class is much broader than current RL, though.
I agree that if it knows for sure that it isn’t in some environment then it doesn’t need to test anything to perform well in that environment. But what if there is a 5% chance that the environment is such that nuclear war is good (e.g. because it eliminates other forms of destructive technology for a long time)? Then this AI would start nuclear war with 5% probability per learning epoch. This is not pure trial-and-error but it is trial-and-error in an important relevant sense.
This seems like an interesting research approach and I don’t object to it. I would object to thinking that algorithms that only handle this class of environments are safe to run in our world (which I expect is not of this form). To be clear, while I expect that a Bayesian-ish agent has a good chance to avoid very bad outcomes using the knowledge it has, I don’t think anything that attains asymptotic optimality will be useful while avoiding very bad outcomes with decent probability.
After thinking some more, maybe the following is natural way towards formalizing the optimism condition.
Let H be the space of hypotheses and ξ0∈ΔH be the “unbiased” universal prior. Given any ζ∈ΔH, we denote ^ζ=Eμ∼ζ[μ], i.e. the environment resulting from mixing the environments in the belief state ζ. Given an environment μ, let πμ be the Bayes-optimal policy for μ and πμθ the perturbed Bayes-optimal policy for μ, where θ is a perturbation parameter. Here, “perturbed” probably means something like softmax expected utility, but more thought is needed. Then, the “optimistic” prior ξ is defined as a solution to the following fixed point equation:
ξ(μ)=Z−1ξ0(μ)exp(β(Eμ⋈π^ξθ[U]−Eμ⋈πμ[U]))
Here, Z is a normalization constant and β is an additional parameter.
This equation defines something like a softmax Nash equilibrium in a cooperative game of two players where, one player chooses μ (so that ξ is eir mixed strategy), another player chooses π and the utility is minus regret (alternatively, we might want to choose only Pareto efficient Nash equilibria). The parameter β controls optimism regarding the ability to learn the environment, whereas the parameter θ represents optimism regarding the presence of slack: ability to learn despite making some errors or random exploration (how to choose these parameters is another question).
Possibly, the idea of exploring the environment “layer by layer” can be recovered from combining this with hierarchy assumptions.
This seems like a hack. The equilibrium policy is going to assume that the environment is good to it in general in a magical fashion, rather than assuming the environment is good to it in the specific ways we should expect given our own knowledge of how the environment works. It’s kind of like assuming “things magically end up lower than you expected on priors” instead of having a theory of gravity.
I think there is something like a theory of gravity here. The things I would note about our universe that make it possible to avoid a lot of traps include:
Physical laws are symmetric across spacetime.
Physical laws are spacially local.
The predictable effects of a local action are typically local; most effects “dissipate” after a while (e.g. into heat). The butterfly effect is evidence for this rather than against this, since it means many effects are unpredictable and so can be modeled thermodynamically.
When small changes have big and predictable effects (e.g. in a computer), there is often agentic optimization power towards the creation and maintenance of this system of effects, and in these cases it is possible for at least some agents to understand important things about how the system works.
Some “partially-dissipated” effects are statistical in nature. For example, an earthquake hitting an area has many immediate effects, but over the long term the important effects are things like “this much local productive activity was disrupted”, “this much local human health was lost”, etc.
You have the genes that you do because evolution, which is similar to a reinforcement learning algorithm, believed that these genes would cause you to survive and reproduce. If we construct AI systems, we will give them code (including a prior) that we expect to cause them to do something useful for us. In general, the agency of an agent’s creator should affect the agent’s beliefs.
If there are many copies of an agent, and successful agents are able to repurpose the resources of unsuccessful ones, then different copies can try different strategies; some will fail but the successful ones can then repurpose their resources. (Evolution can be seen as a special case of this)
Some phenemona have a “fractal” nature, where a small thing behaves similar to a big thing. For example, there are a lot of similarities between the dynamics of a nation and the dynamics of a city. Thus small things can be used as models of big things.
If your interests are aligned with those of agents in your local vicinity, then they will mostly try to help you. (This applies to parents making their children’s environment safe)
I don’t have an elegant theory yet but these observations seem like a reasonable starting point for forming one.
I think that we should expect evolution to give us a prior that is a good lossy compression of actual physics (where “actual physics” means, those patterns the universe has that can be described within our computational complexity bounds). Meaning that, on the one hand it should be low description complexity (otherwise it will be hard for evolution to find it), and on the other hand it should be assign high probability to the true environment (in other words, the KL divergence of the true environment from the prior should be small). And also it should be approximately learnable, otherwise it won’t go from assigning high probability to actually performing well.
The principles you outlined seem reasonable overall.
Note that the locality/dissipation/multiagent assumptions amount to a special case of “the environment is effectively reversible (from the perspective of the human species as a whole) as long as you don’t apply too much optimization power” (“optimization power” probably translates to divergence from some baseline policy plus maybe computational complexity considerations). Now, as you noted before, actual macroscopic physics is not reversible, but it might still be effectively reversible if you have a reliable long-term source of negentropy (like the sun). Maybe we can also slightly relax them by allowing irreversible changes as long as they are localized and the available space is sufficiently big.
“If we construct AI systems, we will give them code (including a prior) that we expect to cause them to do something useful for us. In general, the agency of an agent’s creator should affect the agent’s beliefs” is essentially what DRL does: allows transferring our knowledge to the AI without hard-coding it by hand.
“When small changes have big and predictable effects (e.g. in a computer), there is often agentic optimization power towards the creation and maintenance of this system of effects, and in these cases it is possible for at least some agents to understand important things about how the system works” seems like it would allow us to go beyond effective reversibility, but I’m not sure how to formalize it or whether it’s a justified assumption. One way towards formalizing it is, the prior is s.t. studying the initial state approximate communication class allows determining the entire environment, but this seems to point at a very broad class of approximately learnable priors w/o specifying a criterion how to choose among them.
Another principle that we can try to use is, the ubiquity of analytic functions. Analytic functions have the property that, knowing the function in a bounded domain allows extrapolating it everywhere. This is different from allowing arbitrary computable functions which may have “if” clauses, so that studying the function in a bounded domain is never enough to be sure about its behavior outside it. In particular, this line of inquiry seems relatively easy to formalize using continuous MDPs (although we run into the problem that finding the optimal policy is infeasible, in general). Also, it might have something to do with the effectiveness of neural networks (although the popular ReLU response function is not analytic).
Actually, I am including Bayesianism in “reinforcement learning” in the broad sense, although I am also advocating for some form of asymptotic optimality (importantly, it is not asymptotic in time like often done in the literature, but asymptotic in the time discount parameter; otherwise you give up on most of the utility, like you pointed out in an earlier discussion we had).
In the scenario you describe, the agent will presumably discard (or, strongly penalize the probability of) the pro-nuclear-war hypothesis first since the initial policy loses value much faster on this hypothesis compared to the anti-nuclear-war hypothesis (since the initial policy is biased towards the more likely anti-nuclear-war hypothesis). It will then remain with the anti-nuclear-war hypothesis and follow the corresponding policy (of not starting nuclear war). Perhaps this can be formalized as searching for a fixed point of some transformation.