Yes. The problem is not the Hell scenarios, the problem is that we can make them artificially probable via language choice.
I think this shows how the whole “language independent up to a constant” thing is basically just a massive cop-out.
Some results are still true. An exploring agent (if it survives) will converge on the right environment, independent of language. And episodic environments do allow AIXI to converge on optimal behaviour (as long as the discount rate is gradually raised).
An exploring agent (if it survives) will converge on the right environment, independent of language.
But it seems like such an agent could only survive in an environment where it literally can’t die, i.e., there is nothing it can do that can possibly cause death, since in order to converge on the right environment, independent of language, it has to try all possible courses of action as time goes to infinity and eventually it will do something that kills itself.
What value (either practical or philosophical, as opposed to purely mathematical), if any, do you see in this result, or in the result about episodic environments?
My argument is that “(if it survives) will converge on the right environment, independent of language” is not a property we want in an FAI, because that implies it will try every possible courses of action at some point, including actions that with high probability kills itself or worse (e.g., destroys the universe). Instead, it seems to me what we need is a standard EU maximizing agent that just uses a better prior than merely “universal”, so that it explores (and avoids exploring) in ways that we’d think reasonable. Sorry if I didn’t make that fully explicit or clear. If you still think “an AIXI-like agent that balances exploration and exploitation could be what is needed”, can you please elaborate?
What would be ideal would be a way of establishing the minimal required exploration rate.
Do you mean a way of establishing this independent of the prior, i.e., the agent will explore at some minimum rate regardless of what prior we give it? I don’t think that can be right, since the correct amount of exploration must depend on the prior. (By giving AIXI a different bad prior, we can make it explore too much instead of too little.) For example suppose there are physics theories P1 and P2 that are compatible with all observations so far, and an experiment is proposed to distinguish between them, but the experiment will destroy the universe if P1 is true. Whether or not we should do this experiment must depend on what the correct prior is, right? On the other hand, if we had the correct prior, we wouldn’t need a “minimal required exploration rate”. The agent would just explore/exploit optimally according to the prior.
In theory, changing the exploration rate and changing the prior are equivalent. I think that it might be easier to decide upon an exploration rate that gives a good result for generic priors, than to be sure that generic priors have good exploration rates. But this is just an impression.
In theory, changing the exploration rate and changing the prior are equivalent.
Not really. Standard AIXI is completely deterministic, while the usual exploration strategies for reinforcement learning, such as ɛ-greedy and soft-max, are stochastic.
By changing the prior, you can make an AIXI agent explore more if it receives one set of inputs and also explore less if it receives another set of inputs. You can’t do this by changing an “exploration rate”, unless you’re using some technical definition where it’s not a scalar number?
Given arbitrary computing power and full knowledge of the actual environment, these are equivalent. But, as you point out, in practice they’re going to be different. For us, something simple like a “exploration rate” is probably more understandable for what the AIXI’s actions will look like.
What value (either practical or philosophical, as opposed to purely mathematical), if any, do you see in this result, or in the result about episodic environments?
There are plenty of applications of reinforcement learning where it is plausible to assume that the environment is ergodic (that is, the agent can’t “die” or fall into traps that permanently result in low rewards) or episodic. The Google DQN Atari game agent, for instance, operates in an episodic environment, therefore, stochastic action selection is acceptable.
Of course, this is not suitable for an AGI operating in an unconstrained physical environment.
Yes I agree there can be applications for narrow AI or even limited forms of AGI. I was assuming that Stuart was thinking in terms of FAI so my question was in that context.
Yes. The problem is not the Hell scenarios, the problem is that we can make them artificially probable via language choice.
Some results are still true. An exploring agent (if it survives) will converge on the right environment, independent of language. And episodic environments do allow AIXI to converge on optimal behaviour (as long as the discount rate is gradually raised).
But it seems like such an agent could only survive in an environment where it literally can’t die, i.e., there is nothing it can do that can possibly cause death, since in order to converge on the right environment, independent of language, it has to try all possible courses of action as time goes to infinity and eventually it will do something that kills itself.
What value (either practical or philosophical, as opposed to purely mathematical), if any, do you see in this result, or in the result about episodic environments?
The main value is that it suggests that an AIXI-like agent that balances exploration and exploitation could be what is needed.
My argument is that “(if it survives) will converge on the right environment, independent of language” is not a property we want in an FAI, because that implies it will try every possible courses of action at some point, including actions that with high probability kills itself or worse (e.g., destroys the universe). Instead, it seems to me what we need is a standard EU maximizing agent that just uses a better prior than merely “universal”, so that it explores (and avoids exploring) in ways that we’d think reasonable. Sorry if I didn’t make that fully explicit or clear. If you still think “an AIXI-like agent that balances exploration and exploitation could be what is needed”, can you please elaborate?
We have the universal explorer—it will figure out everything, if it survives, but it’ll almost certainly kill itself.
We have the bad AIXI model above—it will survive for a long time, but is trapped in a bad epistemic state.
What would be ideal would be a way of establishing the minimal required exploration rate.
Do you mean a way of establishing this independent of the prior, i.e., the agent will explore at some minimum rate regardless of what prior we give it? I don’t think that can be right, since the correct amount of exploration must depend on the prior. (By giving AIXI a different bad prior, we can make it explore too much instead of too little.) For example suppose there are physics theories P1 and P2 that are compatible with all observations so far, and an experiment is proposed to distinguish between them, but the experiment will destroy the universe if P1 is true. Whether or not we should do this experiment must depend on what the correct prior is, right? On the other hand, if we had the correct prior, we wouldn’t need a “minimal required exploration rate”. The agent would just explore/exploit optimally according to the prior.
In theory, changing the exploration rate and changing the prior are equivalent. I think that it might be easier to decide upon an exploration rate that gives a good result for generic priors, than to be sure that generic priors have good exploration rates. But this is just an impression.
Not really. Standard AIXI is completely deterministic, while the usual exploration strategies for reinforcement learning, such as ɛ-greedy and soft-max, are stochastic.
By changing the prior, you can make an AIXI agent explore more if it receives one set of inputs and also explore less if it receives another set of inputs. You can’t do this by changing an “exploration rate”, unless you’re using some technical definition where it’s not a scalar number?
Given arbitrary computing power and full knowledge of the actual environment, these are equivalent. But, as you point out, in practice they’re going to be different. For us, something simple like a “exploration rate” is probably more understandable for what the AIXI’s actions will look like.
There are plenty of applications of reinforcement learning where it is plausible to assume that the environment is ergodic (that is, the agent can’t “die” or fall into traps that permanently result in low rewards) or episodic. The Google DQN Atari game agent, for instance, operates in an episodic environment, therefore, stochastic action selection is acceptable.
Of course, this is not suitable for an AGI operating in an unconstrained physical environment.
Yes I agree there can be applications for narrow AI or even limited forms of AGI. I was assuming that Stuart was thinking in terms of FAI so my question was in that context.