TL;DR—A model that has more information in total might have less information about the things you care about, making it less useful in practice. This can hold even if averaging over all possible things one might care about.
Suppose we want to create an agent AI—that is, an AI that takes actions in the world to achieve some goal u (which might for instance be specified as a reward function). In that case, a common approach is to split the agent up into multiple parts, with one part being a model which is optimized for accurately predicting the world, and another being an actor which, given the model, chooses its actions in such a way as to achieve the goal u.
This leads to an important observation: The model is not directly optimized to help the agent achieve u. There are important technical reasons why this is the case; for instance, it can be hard to effectively figure out how the model relates to the agent’s ability to achieve u, and it can be sample-inefficient for an agent not to exploit the rich information it gets from its observations of the world.
So here’s a question—if you improve the model’s accuracy, do you also improve the actor’s ability to achieve u? I will present two counterexamples to this claim in the post, and then discuss the implications.
The hungry tourist
Imagine that you are a tourist in a big city. You are hungry right now, so you would like to know where there is some place to eat, and afterwards you would like to know the location of various sights and attractions to visit.
Luckily, you have picked up a map for tourists at the airport, giving you a thorough guide to all of these places. Unfortunately, you have then run into the supervillain Unhelpful Man, who stole your tourist map and replaced it with a long book which labels the home of each person living in the city, but doesn’t label any tourist attractions or food places at all.
You are very annoyed by having to go back to the airport for a new map, but Unhelpful Man explains that he actually helped you, because your new book contains much more accurate information about the city than the map did. Sure, the book might not contain the information about the attractions you actually want to visit, but entropy-wise his book more than makes up for that by all of the details it has about who lives where.
In a machine learning model, this book could quite plausibly achieve a lower “loss” on predicting information about the city than the original tourist’s map could. So, it is a more accurate model. But for the hungry tourist’s goal, it is much less useful. It could still quite plausibly be useful for achieving other goals, though, so could one perhaps hypothesize that a more accurate model would tend to be more helpful when averaged across many different goals? This could perhaps mean that optimizing for model accuracy is necessary if you want to construct the AI in a modular way, creating the model without worrying about the actor’s goal.
The death chamber
The supervillain Unhelpful Man has become very mad because you did not appreciate his book of the city, so he has trapped you in a locked room. The door to the room has a password, and if you enter the wrong password, an elaborate mechanism will come out to kill you.
Unhelpful Man laughs at you and explains that he has a whole library of books which together describe the death chamber quite well. He offers you to trade the book describing the city for the library, as long as you admit that more information is better. Desperate to escape, you accept the trade.
Unfortunately, you can’t seem to find any book in the library that lists the passwords. Instead, all the books go into excruciating detail about the killing mechanisms; you learn a lot about saws, poison gas, electrocution, radiation, etc., but there seems to be nothing that helps you escape, no mention of the password anywhere. Obviously Unhelpful Man is untrustworthy, but you see no other choice than to ask him for help.
“Oh, the password? It’s not in any of these books, it’s on page 152 of the book about the city. I told you that it was an important book.”
Again we have the same problem; you traded off a smaller amount of information, the book containing the password to escape the room, for a greater amount of information, the library. However, the smaller amount of information was much more relevant to your needs. The specific difference for the death chamber, though, is that survival and escape is a convergent instrumental subgoal which we would expect most goals to imply. Thus in this case we can say that the more accurate model is usually worse for the actor than the less accurate one, even when averaging over goals.
Value of information
The above point boils down to an extremely standard problem: Not all information is equally valuable. I am just making it an extreme by considering the tradeoff of a small bit of highly valuable information against a large amount of worthless information. For many technical reasons, we want a metric for evaluating models that focuses on the amount of information content. But such a metric will ignore value of information, and thus not necessarily encourage the best models for agentic purposes.
And this might be totally fine. After all, even if it doesn’t necessarily encourage the best models for agentic purposes, it might still be “good enough” in practice? My main motivation for writing this post is that I want to better understand the nature of models that are not fully optimized to reach the best conceivable loss, and so it seemed logical to consider whether getting a better loss can in some sense be said to be better for your ability to achieve your goals. There doesn’t appear to be an unconditional proof here, though.
More accurate models can be worse
TL;DR—A model that has more information in total might have less information about the things you care about, making it less useful in practice. This can hold even if averaging over all possible things one might care about.
Suppose we want to create an agent AI—that is, an AI that takes actions in the world to achieve some goal u (which might for instance be specified as a reward function). In that case, a common approach is to split the agent up into multiple parts, with one part being a model which is optimized for accurately predicting the world, and another being an actor which, given the model, chooses its actions in such a way as to achieve the goal u.
This leads to an important observation: The model is not directly optimized to help the agent achieve u. There are important technical reasons why this is the case; for instance, it can be hard to effectively figure out how the model relates to the agent’s ability to achieve u, and it can be sample-inefficient for an agent not to exploit the rich information it gets from its observations of the world.
So here’s a question—if you improve the model’s accuracy, do you also improve the actor’s ability to achieve u? I will present two counterexamples to this claim in the post, and then discuss the implications.
The hungry tourist
Imagine that you are a tourist in a big city. You are hungry right now, so you would like to know where there is some place to eat, and afterwards you would like to know the location of various sights and attractions to visit.
Luckily, you have picked up a map for tourists at the airport, giving you a thorough guide to all of these places. Unfortunately, you have then run into the supervillain Unhelpful Man, who stole your tourist map and replaced it with a long book which labels the home of each person living in the city, but doesn’t label any tourist attractions or food places at all.
You are very annoyed by having to go back to the airport for a new map, but Unhelpful Man explains that he actually helped you, because your new book contains much more accurate information about the city than the map did. Sure, the book might not contain the information about the attractions you actually want to visit, but entropy-wise his book more than makes up for that by all of the details it has about who lives where.
In a machine learning model, this book could quite plausibly achieve a lower “loss” on predicting information about the city than the original tourist’s map could. So, it is a more accurate model. But for the hungry tourist’s goal, it is much less useful. It could still quite plausibly be useful for achieving other goals, though, so could one perhaps hypothesize that a more accurate model would tend to be more helpful when averaged across many different goals? This could perhaps mean that optimizing for model accuracy is necessary if you want to construct the AI in a modular way, creating the model without worrying about the actor’s goal.
The death chamber
The supervillain Unhelpful Man has become very mad because you did not appreciate his book of the city, so he has trapped you in a locked room. The door to the room has a password, and if you enter the wrong password, an elaborate mechanism will come out to kill you.
Unhelpful Man laughs at you and explains that he has a whole library of books which together describe the death chamber quite well. He offers you to trade the book describing the city for the library, as long as you admit that more information is better. Desperate to escape, you accept the trade.
Unfortunately, you can’t seem to find any book in the library that lists the passwords. Instead, all the books go into excruciating detail about the killing mechanisms; you learn a lot about saws, poison gas, electrocution, radiation, etc., but there seems to be nothing that helps you escape, no mention of the password anywhere. Obviously Unhelpful Man is untrustworthy, but you see no other choice than to ask him for help.
“Oh, the password? It’s not in any of these books, it’s on page 152 of the book about the city. I told you that it was an important book.”
Again we have the same problem; you traded off a smaller amount of information, the book containing the password to escape the room, for a greater amount of information, the library. However, the smaller amount of information was much more relevant to your needs. The specific difference for the death chamber, though, is that survival and escape is a convergent instrumental subgoal which we would expect most goals to imply. Thus in this case we can say that the more accurate model is usually worse for the actor than the less accurate one, even when averaging over goals.
Value of information
The above point boils down to an extremely standard problem: Not all information is equally valuable. I am just making it an extreme by considering the tradeoff of a small bit of highly valuable information against a large amount of worthless information. For many technical reasons, we want a metric for evaluating models that focuses on the amount of information content. But such a metric will ignore value of information, and thus not necessarily encourage the best models for agentic purposes.
And this might be totally fine. After all, even if it doesn’t necessarily encourage the best models for agentic purposes, it might still be “good enough” in practice? My main motivation for writing this post is that I want to better understand the nature of models that are not fully optimized to reach the best conceivable loss, and so it seemed logical to consider whether getting a better loss can in some sense be said to be better for your ability to achieve your goals. There doesn’t appear to be an unconditional proof here, though.
Thanks to Justis Mills for proofreading.