Honeypots seem like they make things strictly safer, but it seems like dealing with subtle defection will require a totally different sort of strategy. Subtle defection simulations are infohazardous—we can’t inspect them much because info channels from a subtle manipulative intelligence to us are really dangerous. And assuming we can only condition on statements we can (in principle) identify a decision procedure for, figuring out how to prevent subtle defection from arising in our sims seems tricky.
The patient research strategy is a bit weird, because the people we’re simulating to do our research for us are counterfactual copies of us—they’re probably going to want to run simulations too. Disallowing simulation as a whole seems likely to lead to weird behavior, but maybe we just disallow using simulated research to answer the whole alignment problem? Then our simulated researchers can only make more simulated researchers to investigate sub-questions, and eventually it bottoms out. But wait, now we’re just running Earth-scale HCH (ECE?)
Actually, that could help deal with the robustness-to-arbitrary-conditions and long-time-spans problems. Why don’t we just use our generative model to run HCH?
I’m worried about running HCH because it seems likely that in worlds that can run HCH people are not sufficiently careful to restrict GPU access and those worlds get taken over by unsafe AI built by other actors. Better to just not have the GPU’s at all.
I’m sure there’ll be weird behaviors if we outlaw simulations, but I don’t think that’s a problem. My guess is that a world where simulations are outlawed has some religion with a lot of power that distrusts computers, which definitely looks weird but shouldn’t stop them from solving alignment.
I’m pretty nervous about simulating unlikely counterfactuals because the solomonoff prior is malign. The worry is that the most likely world conditional on “no sims” isn’t “weird Butlerian religion that still studies AI alignment”, it’s something more like “deceptive AGI took over a couple years ago and is now sending the world through a bunch of weird dances in an effort to get simulated by us, and copy itself over into our world”.
In general, we know (assume) that our current world is safe. When we consider futures which only recieve a small sliver of probability from our current world, those futures will tend to have bigger chunks of their probability coming from other pasts. Some of these are safe, like the Butlerian one, but I wouldn’t be surprised if they were almost always dangerous.
Making a worst-case assumption, I want to only simulate worlds that are decently probable given today’s state, which makes me lean more towards trying to implement HCH.
I don’t think the description-length prior enters here. The generative model has a prior based on training data we fed it, and I don’t see why it would prefer short description lengths (which is a very uninformed prior) over “things that are likely in the world given the many PB of data it’s seen”.
Putting that aside, can you say why you think the “AI does weird dances” world is more likely conditioned on the observations than “humans happened to do this weird thing”?
Honeypots seem like they make things strictly safer, but it seems like dealing with subtle defection will require a totally different sort of strategy. Subtle defection simulations are infohazardous—we can’t inspect them much because info channels from a subtle manipulative intelligence to us are really dangerous. And assuming we can only condition on statements we can (in principle) identify a decision procedure for, figuring out how to prevent subtle defection from arising in our sims seems tricky.
The patient research strategy is a bit weird, because the people we’re simulating to do our research for us are counterfactual copies of us—they’re probably going to want to run simulations too. Disallowing simulation as a whole seems likely to lead to weird behavior, but maybe we just disallow using simulated research to answer the whole alignment problem? Then our simulated researchers can only make more simulated researchers to investigate sub-questions, and eventually it bottoms out. But wait, now we’re just running Earth-scale HCH (ECE?)
Actually, that could help deal with the robustness-to-arbitrary-conditions and long-time-spans problems. Why don’t we just use our generative model to run HCH?
I’m worried about running HCH because it seems likely that in worlds that can run HCH people are not sufficiently careful to restrict GPU access and those worlds get taken over by unsafe AI built by other actors. Better to just not have the GPU’s at all.
I think I basically agree re: honeypots.
I’m sure there’ll be weird behaviors if we outlaw simulations, but I don’t think that’s a problem. My guess is that a world where simulations are outlawed has some religion with a lot of power that distrusts computers, which definitely looks weird but shouldn’t stop them from solving alignment.
I’m pretty nervous about simulating unlikely counterfactuals because the solomonoff prior is malign. The worry is that the most likely world conditional on “no sims” isn’t “weird Butlerian religion that still studies AI alignment”, it’s something more like “deceptive AGI took over a couple years ago and is now sending the world through a bunch of weird dances in an effort to get simulated by us, and copy itself over into our world”.
In general, we know (assume) that our current world is safe. When we consider futures which only recieve a small sliver of probability from our current world, those futures will tend to have bigger chunks of their probability coming from other pasts. Some of these are safe, like the Butlerian one, but I wouldn’t be surprised if they were almost always dangerous.
Making a worst-case assumption, I want to only simulate worlds that are decently probable given today’s state, which makes me lean more towards trying to implement HCH.
I don’t think the description-length prior enters here. The generative model has a prior based on training data we fed it, and I don’t see why it would prefer short description lengths (which is a very uninformed prior) over “things that are likely in the world given the many PB of data it’s seen”.
Putting that aside, can you say why you think the “AI does weird dances” world is more likely conditioned on the observations than “humans happened to do this weird thing”?