So if we view an epistemic subsystem as an super intelligent agent who has control over the map and has the goal of make the map match the territory, one extreme failure mode is that it takes a hit to short term accuracy by slightly modifying the map in such a way as to trick the things looking at the map into giving the epistemic subsystem more control. Then, once it has more control, it can use it to manipulate the territory to make the territory more predictable. If your goal is to minimize surprise, you should destroy all the surprising things.
Note that we would not make an epistemic system this way, a more realistic model of the goal of an epistemic system we would build is “make the map match the territory better than any other map in a given class,” or even “make the map match the territory better than any small modification to the map.” But a large point of the section is that if you search strategies that “make the map match the territory better than any other map in a given class,” at small scales, this is the same as “make the map match the territory.” So you might find “make the map match the territory” optimizers, and then go wrong in the way above.
I think all this is pretty unrealistic, and I expect you are much more likely to go off in a random direction than something that looks like a specific subsystem the programmers put in gets too much power and optimizes stabile for what the programmers said. We would need to understand a lot more before we would even hit the failure mode of making a system where the epistemic subsystem was agenticly optimizing what it was supposed to be optimizing.
It’s easy to find ways of searching for truth in ways that harm the instrumental goal.
Example 1: I’m a self-driving car AI, and don’t know whether hitting pedestrians at 35 MPH is somewhat bad, because it injures them, or very bad, because it kills them. I should not gather data to update my estimate.
Example 2: I’m a medical AI. Repeatedly trying a potential treatment that I am highly uncertain about the effects of to get high-confidence estimates isn’t optimal. I should be trying to maximize something other than knowledge. Even though I need to know whether treatments work, I should balance the risks and benefits of trying them.
How could a (relatively) ‘too-strong’ epistemic subsystem be a bad thing?
So if we view an epistemic subsystem as an super intelligent agent who has control over the map and has the goal of make the map match the territory, one extreme failure mode is that it takes a hit to short term accuracy by slightly modifying the map in such a way as to trick the things looking at the map into giving the epistemic subsystem more control. Then, once it has more control, it can use it to manipulate the territory to make the territory more predictable. If your goal is to minimize surprise, you should destroy all the surprising things.
Note that we would not make an epistemic system this way, a more realistic model of the goal of an epistemic system we would build is “make the map match the territory better than any other map in a given class,” or even “make the map match the territory better than any small modification to the map.” But a large point of the section is that if you search strategies that “make the map match the territory better than any other map in a given class,” at small scales, this is the same as “make the map match the territory.” So you might find “make the map match the territory” optimizers, and then go wrong in the way above.
I think all this is pretty unrealistic, and I expect you are much more likely to go off in a random direction than something that looks like a specific subsystem the programmers put in gets too much power and optimizes stabile for what the programmers said. We would need to understand a lot more before we would even hit the failure mode of making a system where the epistemic subsystem was agenticly optimizing what it was supposed to be optimizing.
It’s easy to find ways of searching for truth in ways that harm the instrumental goal.
Example 1: I’m a self-driving car AI, and don’t know whether hitting pedestrians at 35 MPH is somewhat bad, because it injures them, or very bad, because it kills them. I should not gather data to update my estimate.
Example 2: I’m a medical AI. Repeatedly trying a potential treatment that I am highly uncertain about the effects of to get high-confidence estimates isn’t optimal. I should be trying to maximize something other than knowledge. Even though I need to know whether treatments work, I should balance the risks and benefits of trying them.