The first intuition pump that comes to mind for distinguishing mechanisms is examining how my brain generates and assigns credence to the hypothesis that something going wrong with my car is a sensor malfunction vs telling me about a problem in the world that the sensor exists to alert me to.
One thing that happens is that the broken sensor implies a much larger space of worlds because it can vary arbitrarily instead of only in tight informational coupling with the underlying physical system. So fluctuations outside the historical behavior of the sensor either implies I’m in some sort of weird environment or that the sensor is varying with something besides what it is supposed to measure, a hidden variable if coherent or noisy if random. So the detection is tied to why it is desirable to goodhart the sensor in the first place, more option value by allowing consistency with a broader range of worlds. By the same token, the hypothesis “the sensor is broken” should be harder to falsify since the hypothesis is consistent with lots of data? The first thing it occurs to me to do is supply a controlled input to see if I get a controlled output (see: calibrating a scale by using a known weight). This suggests that complex sensors that couple with the environment along more dimensions are harder to fool, though any data bottlenecks that are passed through reduce this i.e. the human reviewing things is themselves using a learnable simple routine that exhibits low coupling.
The next intuition pump, imagine there are two mechanics. One makes a lot of money from replacing sensors, they’re fast at it and get the sensors for a discount by buying in bulk. The second mechanic makes a lot of money by doing a lot of really complicated testing and work. They work on fewer cars but the revenue per car is high. Each is unscrupulous and will lie that your problem is the one they are good at fixing. I try to imagine the sorts of things they would tell me to convince me the problem is really the sensor vs the problem is really out in the world. This even suggests a three player game that might generate additional ideas.
The first intuition pump that comes to mind for distinguishing mechanisms is examining how my brain generates and assigns credence to the hypothesis that something going wrong with my car is a sensor malfunction vs telling me about a problem in the world that the sensor exists to alert me to.
One thing that happens is that the broken sensor implies a much larger space of worlds because it can vary arbitrarily instead of only in tight informational coupling with the underlying physical system. So fluctuations outside the historical behavior of the sensor either implies I’m in some sort of weird environment or that the sensor is varying with something besides what it is supposed to measure, a hidden variable if coherent or noisy if random. So the detection is tied to why it is desirable to goodhart the sensor in the first place, more option value by allowing consistency with a broader range of worlds. By the same token, the hypothesis “the sensor is broken” should be harder to falsify since the hypothesis is consistent with lots of data? The first thing it occurs to me to do is supply a controlled input to see if I get a controlled output (see: calibrating a scale by using a known weight). This suggests that complex sensors that couple with the environment along more dimensions are harder to fool, though any data bottlenecks that are passed through reduce this i.e. the human reviewing things is themselves using a learnable simple routine that exhibits low coupling.
The next intuition pump, imagine there are two mechanics. One makes a lot of money from replacing sensors, they’re fast at it and get the sensors for a discount by buying in bulk. The second mechanic makes a lot of money by doing a lot of really complicated testing and work. They work on fewer cars but the revenue per car is high. Each is unscrupulous and will lie that your problem is the one they are good at fixing. I try to imagine the sorts of things they would tell me to convince me the problem is really the sensor vs the problem is really out in the world. This even suggests a three player game that might generate additional ideas.