I’m extremely flattered at the award; I’ve been on LessWrong for like a month, and definitely did not expect this. I can confirm to you guys that this makes me want to try harder at ELK, so your incentive is working!
I want to rebut your arguments in “Strategy: Predict hypothetical sensors” in your Counterxamples to some ELK proposals post. I’m reproducing it in full here for convenience.
Strategy: Predict hypothetical sensors
(Proposal #2 here, also suggested with counterexample by Rohin in private communication)
Instead of installing a single sensor, I could randomly choose what sensors to deploy and how to install them. Then I could train a model Predict+ to map (description of sensor) → (sensor reading), and I could give a human the readings from many hypothetical sensors in order to give them the best chance of detecting tampering.
Worry 1: Predict+ may not be reasonable on non-existent sensors
To collect training data for Predict+, I actually install sensors and then use their readings as a target. In every situation, my AI is being asked to predict the readings from a sensor that I really did build.
If my AI is able to deduce the presence of sensors (as described in the last counterexample), then it’s unclear how Predict+(sensor description) will behave when we give it a description of a sensor that does not in fact exist.
In particular, Predict+(sensor description) might behave in the following undesirable way: it predicts the future observations conditioned on its observations and on the fact that someone typed (sensor description) into the dataset. But it knows that no sensor with description (sensor description) exists. So if someone typed that description into the dataset, that just means something funny is going on—not that the dataset will magically contain readings from a non-existent sensor.
For a predictor which understands the world well it seems quite plausible that this is the most natural behavior. Otherwise it needs to reason about a physically impossible counterfactual to get the “intended” answer (which it never needed to do to answer any other questions). Moreover, it would also be undesirable if the predictor conditioned on such a sensor existing despite having strong evidence to the contrary (since its predictions may then be extremely strange).
(A more extreme version of the same worry applies to proposal #3 here. A predictor seems even more likely to behave badly when it is given a description of a sensor that the human obviously couldn’t have built, and it would definitely know that the sensor doesn’t actually exist.)
It’s true that, if we were to simply describe magic-based sensors, we would likely get weird off-distribution behavior. But to get more concrete, we could have a separate AI system provide us with step-by-step instructions for how to build a sensor, and then input the execution of these instructions as part of the world state.
To answer the “how does this sensor-builder know how to build sensors that sense the thing it’s claiming”, I’d posit that this problem is much simpler than full ELK (we can test a given sensor design offline in many specific scenarios).
To answer the “how do you choose which sensors to tell sensor-builder to build” question, I’d posit that the space of possible sensing modalities is actually finite, if large, and I’d wager we can beforehand use AI to help us figure out reasonable coverage of all possible sensors that measure quantities of interest. Why do I think the space is finite? Maybe I’m thinking too narrowly, but it seems like we know most of the dimensions of variation of measurable reality-affecting actions: detected signal type (passive or active in response to output, like echolocation), signal amplitude (small-scale or large-scale detection), wavelength, detector density (close-by like photoreceptors in the eye or far apart like touch cells, but also density along other dimensions like wavelength), relative orientation, and perhaps a few others.
One possible breaker response is, what about ontology mismatches that occur because the “scary” fact is an enormous conjunction of different facts? This seems addressable via impulse-sensors, i.e. “sensors” that combine an intervention in the system with an observed response. An example of an impulse is “an innocent bystander walks into the diamond room”, and the attached sensor is a video feed of what happens to them (and whether they get vaporized by nanomachines). Or, for a more out-there example, “an omnipotent previously-undetectable alien AI, that had been running at a layer beneath all human semiconductors, activates at timestep 592, scoops up a handful of nanomachines, deposits them on the electron microscope, and then deactivates forever”, followed by the microscope readings. This part is half-baked, and if you can provide a strong Breaker response to this scenario, it’d be helpful for me to noodle on.
I’d expect another possible breaker response here is “how does a human understand the possibly-unconnected-if-small array of sensor outputs well enough to actually understand what they imply together?” In my original proposal, the “terrifier” was an AI that searched over the predictor’s Bayes net for where to deploy sensors such that their readings would maximize the probability the human would reject the proposal. We can add to that the terrifier’s ability to provide a short (<5 page) description of how to interpret the sensors in conjunction, short enough that a committee of humans could plausibly check its logic and make sure it wasn’t just scaring the humans without cause. For a decent set of sensors and impulses, this covers any scenario I can quickly think of, including “everyone has been copied using nanotech” (which seems like it should be detectable at the nanometer range somewhere along the line).
If the breaker wants to claim that there is bad stuff that no small combination of futuristic sensors could detect at any scale, I’d need to hear a specific example.
Worry 2: the human may not be able to tell what is going on even with all the sensors
For some kinds of tampering no possible sensors would reveal what had happened. For example, we can imagine that our AI has covertly built an army of tiny machines which deconstruct the objects in the room while replacing them with fake versions that look the same on the outside. Sufficiently sophisticated machinery could reproduce the results of almost any experiment we perform (e.g. if you broke an egg, the tiny machines could simulate the sound of an egg cracking and quickly create the yolk on demand).
More realistically, our AI could search for sensors and tamper with any that it finds. It’s not clear how it is supposed to answer questions about what a sensor “would have shown”—if the hypothetical sensor had actually existed, then it also would have been found and tampered with. In this scenario, it may be “correct” for Predict+ to report the tampered value.
(As usual, this is only a problem if the human overseer literally can’t notice anything suspicious about a sequence of actions which is in fact constructing an army of sophisticated tiny machines. We are putting the burden of proof on the builder to show that a given concern is impossible.)
I can’t think of a scenario where an army of tiny tampering machines isn’t detectable to a tiny sensor somewhere before they’re all active and have already eaten the sensor, for a sufficient set of possible sensors as previously described. Maybe your mental model of the predictor doesn’t include detailed intermediate states? But it seems like we would need to populate those intermediate states in order to simulate sensor outputs anyway.
Based on any comment-tree, I’ll decide whether I believe in this enough to write it up as a formal submission.
I’m extremely flattered at the award; I’ve been on LessWrong for like a month, and definitely did not expect this. I can confirm to you guys that this makes me want to try harder at ELK, so your incentive is working!
I want to rebut your arguments in “Strategy: Predict hypothetical sensors” in your Counterxamples to some ELK proposals post. I’m reproducing it in full here for convenience.
It’s true that, if we were to simply describe magic-based sensors, we would likely get weird off-distribution behavior. But to get more concrete, we could have a separate AI system provide us with step-by-step instructions for how to build a sensor, and then input the execution of these instructions as part of the world state.
To answer the “how does this sensor-builder know how to build sensors that sense the thing it’s claiming”, I’d posit that this problem is much simpler than full ELK (we can test a given sensor design offline in many specific scenarios).
To answer the “how do you choose which sensors to tell sensor-builder to build” question, I’d posit that the space of possible sensing modalities is actually finite, if large, and I’d wager we can beforehand use AI to help us figure out reasonable coverage of all possible sensors that measure quantities of interest. Why do I think the space is finite? Maybe I’m thinking too narrowly, but it seems like we know most of the dimensions of variation of measurable reality-affecting actions: detected signal type (passive or active in response to output, like echolocation), signal amplitude (small-scale or large-scale detection), wavelength, detector density (close-by like photoreceptors in the eye or far apart like touch cells, but also density along other dimensions like wavelength), relative orientation, and perhaps a few others.
One possible breaker response is, what about ontology mismatches that occur because the “scary” fact is an enormous conjunction of different facts? This seems addressable via impulse-sensors, i.e. “sensors” that combine an intervention in the system with an observed response. An example of an impulse is “an innocent bystander walks into the diamond room”, and the attached sensor is a video feed of what happens to them (and whether they get vaporized by nanomachines). Or, for a more out-there example, “an omnipotent previously-undetectable alien AI, that had been running at a layer beneath all human semiconductors, activates at timestep 592, scoops up a handful of nanomachines, deposits them on the electron microscope, and then deactivates forever”, followed by the microscope readings. This part is half-baked, and if you can provide a strong Breaker response to this scenario, it’d be helpful for me to noodle on.
I’d expect another possible breaker response here is “how does a human understand the possibly-unconnected-if-small array of sensor outputs well enough to actually understand what they imply together?” In my original proposal, the “terrifier” was an AI that searched over the predictor’s Bayes net for where to deploy sensors such that their readings would maximize the probability the human would reject the proposal. We can add to that the terrifier’s ability to provide a short (<5 page) description of how to interpret the sensors in conjunction, short enough that a committee of humans could plausibly check its logic and make sure it wasn’t just scaring the humans without cause. For a decent set of sensors and impulses, this covers any scenario I can quickly think of, including “everyone has been copied using nanotech” (which seems like it should be detectable at the nanometer range somewhere along the line).
If the breaker wants to claim that there is bad stuff that no small combination of futuristic sensors could detect at any scale, I’d need to hear a specific example.
I can’t think of a scenario where an army of tiny tampering machines isn’t detectable to a tiny sensor somewhere before they’re all active and have already eaten the sensor, for a sufficient set of possible sensors as previously described. Maybe your mental model of the predictor doesn’t include detailed intermediate states? But it seems like we would need to populate those intermediate states in order to simulate sensor outputs anyway.
Based on any comment-tree, I’ll decide whether I believe in this enough to write it up as a formal submission.