Scott Garrabrant comments on Subsystem Alignment

Scott Garrabrant 7 Nov 2018 18:52 UTC
LW: 14 AF: 8
AF
So if we view an epistemic subsystem as an super intelligent agent who has control over the map and has the goal of make the map match the territory, one extreme failure mode is that it takes a hit to short term accuracy by slightly modifying the map in such a way as to trick the things looking at the map into giving the epistemic subsystem more control. Then, once it has more control, it can use it to manipulate the territory to make the territory more predictable. If your goal is to minimize surprise, you should destroy all the surprising things.
Note that we would not make an epistemic system this way, a more realistic model of the goal of an epistemic system we would build is “make the map match the territory better than any other map in a given class,” or even “make the map match the territory better than any small modification to the map.” But a large point of the section is that if you search strategies that “make the map match the territory better than any other map in a given class,” at small scales, this is the same as “make the map match the territory.” So you might find “make the map match the territory” optimizers, and then go wrong in the way above.
I think all this is pretty unrealistic, and I expect you are much more likely to go off in a random direction than something that looks like a specific subsystem the programmers put in gets too much power and optimizes stabile for what the programmers said. We would need to understand a lot more before we would even hit the failure mode of making a system where the epistemic subsystem was agenticly optimizing what it was supposed to be optimizing.