Curiosity Killed the Cat and the Asymptotically Optimal Agent
Here’s a new(ish) paper that I worked on with Marcus Hutter.
If an agent is exploring enough to be guaranteed strong performance in the limit, that much exploration is enough to kill it (if the environment is minimally difficult and dangerous). It’s nothing too surprising, but if you’re making a claim along these lines about exploration being dangerous, and you need something to cite, this might work.
My attitude towards safe exploration is: exploration isn’t safe. Don’t do it. Have a person or some trusted entity do it for you. The paper can also be read as a justification of that view.
Obviously, there are many more details in the paper.
From your paper:
Given this, why is your main conclusion “Perhaps our results suggest we are in need of more theory regarding the ‘parenting’ of artificial agents” instead of “We should use Bayesian optimality instead of asymptotic optimality”?
The simplest version of the parenting idea includes an agent which is Bayes-optimal. Parenting would just be designed to help out a Bayesian reasoner, since there’s not much you can say about to what extent a Bayesian reasoner will explore, or how much it will learn; it all depends on its prior. (Almost all policies are Bayes-optimal with respect to some (universal) prior). There’s still a fundamental trade-off between learning and staying safe, so while the Bayes-optimal agent does not do as bad a job in picking a point on that trade-off as the asymptotically optimal agent, that doesn’t quite allow us to say that it will pick the right point on the trade-off. As long as we have access to “parents” that might be able to guide an agent toward world-states where this trade-off is less severe, we might as well make use of them.
And I’d say it’s more a conclusion, not a main one.
This question is probably stupid, and also kinda generic (it applies to many other papers besides this one), but forgive me for asking it anyway.
So, I’m trying to think through how this kind of result generalizes beyond MDPs. In my own life, I don’t go wandering around an environment looking for piles of cash that got randomly left on the sidewalk. My rewards aren’t random. Instead, I have goals (or more generally, self-knowledge of what I find rewarding), and I have abstract knowledge constraining the ways in which those goals will or won’t happen.
Yes, I do still have to do exploration—try new foods, meet new people, ponder new ideas, etc.—but because of my general prior knowledge about the world, this exploration kinda feels different than the kind of exploration that they talk about in MDPs. It’s not really rolling the dice, I generally have a pretty good idea of what to expect, even if it’s still a bit uncertain along some axes.
So, how do you think about the generalizability of these kinds of MDP results?
(I like the paper, by the way!)
Well, nothing in the paper has to do with MDPs! The results are for general computable environments. Does that answer the question?
Hmm, I think I get it. Correct me if I’m wrong.
Your paper is about an agent which can perform well in any possible universe. (That’s the “for all ν in ℳ”). That includes universes where the laws of physics suddenly change tomorrow. But in real life, I know that the laws of physics are not going to change tomorrow. Thus, I can get optimal results without doing the kind of exhaustive exploration that your paper is talking about. Agree or disagree?
Certainly for the true environment, the optimal policy exists and you could follow it. The only thing I’d say differently is that you’re pretty sure the laws of physics won’t change tomorrow. But more realistic forms of uncertainty doom us to either forego knowledge (and potentially good policies) or destroy ourselves. If one slowed down science in certain areas for reasons along the lines of the vulnerable world hypothesis, that would be taking the “safe stance” in this trade off.
Thanks!
A decent intuition might be to think about what exploration looks like in human children. Children under the age of 5 but old enough to move about on their own—so toddlers, not babies or “big kids”—face a lot of dangers in the modern world if they are allowed to run their natural exploration algorithm. Heck, I’m not even sure this is a modern problem, because in addition to toddlers not understanding and needing to be protected from exploring electrical sockets and moving vehicles they also have to be protected from more traditional dangers that they would definitely otherwise check out like dangerous plants and animals. Of course, since toddlers grow up into powerful adult humans, this is a kind of evidence that they are powerful enough explorers (even with protections) to become powerful enough to function in society.
Obviously there are a lot of caveats to taking this idea too seriously since I’ve ignored issues related to human development, but I think it points in the right direction of something everyday that reflects this result.
The last paragraph of the conclusion (maybe you read it?) is relevant to this.
Planned summary for the Alignment Newsletter:
Planned opinion:
Thanks Rohin!
It seems like the upshot is that even weak optimality is too strong, since it has to try everything once. How does one make even weaker guarantees of good behavior that are useful in proving things, without just defaulting to expected utility maximization?
I don’t think there’s really a good answer. Section 6 Theorem 4 is my only suggestion here.
After a bit more thought, I’ve learned that it’s hard to avoid ending back up with EU maximization—it basically happens as soon as you require that strategies be good not just on the true environment, but on some distribution of environments that reflect what we think we’re designing an agent for (or the agent’s initial state of knowledge about states of the world). And since this is such an effective tool at penalizing the “just pick the absolute best answer” strategy, it’s hard for me to avoid circling back to it.
Here’s one possible option, though: look for strategies that are too simple to encode the one best answer in the first place. If the absolute best policy has K-complexity of 10^3 (achievable in the real world by strategies being complicated, or in the multi-armed bandit case by just having 2^1000 possible actions) and your agent is only allowed to start with 10^2 symbols, this might make things interesting.
Maybe optimality relative to the best performer out of some class of algorithms that doesn’t include “just pick the absolute best answer?” You basically prove that in environments with traps, anything that would, absent traps, be guaranteed to find the absolute best answer will instead get trapped. So those aren’t actually very good performers.
I just can’t come up with anything too clever, though, because the obvious classes of algorithms, like “polynomial time,” include the ability to just pick the absolute best answer by luck.