The true cost of fences

Most people have fences in their conversation tree. They have topics which they will not talk about or which they will only talk about under certain conditions. In some cases those policies are public, in others people have to learn the hard way, i.e. they find out by running into the fence. Having fences is normal and often necessary, but the price you pay for them is greater than what most people realise.

The intuitive price you pay for a fence is that you will not hear certain ideas and will not be able to relay certain ideas to others in exchange for ensuring safety, comfort or something else entirely. You might additionally consider the cost of others being angry or disappointed to reach a wall. What most people do not take into account however is that because things lie behind the fence which others might value, they will take actions to get around it. The concern isn’t the mere fact of them getting around it (that’s not avoided by not having a fence after all), it’s the how. What’s the next best path now that this one is blocked? Someone, in a sincere attempt to comply with your policy may look for a “legal” way through, by taking actions you would not condone. You have created an incentive to do so, and you cannot be surprised that people will now use side paths they would not have used if there was no fence. You have to make sure those side paths are safe. If they are not, it is irresponsible of you to redirect people there.

Here’s a particularly dire example concerning information policy: Let us say that there is an AI safety researcher Aaron who is unwilling to hear of infohazards, because the tracking and secrecy involved acts as a drain of spoons for him. This is understandable, and not in itself a problem. Note that this hypothetical researcher does consider the secrecy of infohazards important. They will not take them on because they would keep them safe. Diligently and stressfully.

Let us now imagine that Aaron is approached by another AI safety researcher called Beth. Beth has reason to believe that Aaron is the best person to consult with regards to a specific idea of hers, but explaining the idea would involve a significant amount of hazardous information. Here’s where things get interesting.

The field in question is AI safety after all, so Beth believes her research –and by extension this conversation– to be existentially important. Aaron is in the same boat, so Aaron knows this. Beth believes her research to be infohazardous – its spread could accelerate the apocalypse. Aaron does not know whether this is a correct assessment in this particular case, but he does believe in infohazards. Aaron knows that this is true in some cases. And so, in locking important insight behind a restriction of “It can’t be infohazardous”, Aaron implicitly creates an incentive for Beth, and any other hypothetical person in a similar position, to downgrade their risk assessment. To maybe not reveal the whole thing, but bits around the edges which they’re not happy about sharing unguarded, but which seem maybe, plausibly safe. Beth is by no means flippant to do this. She is aware that the risk is catastrophic, but nonetheless she has a strong incentive to share anyway, because she believe that this conversation has a chance at being vital in preventing cataclysm.

In everything but the most contrived scenarios, Beth would be wrong to downgrade her risk assessment. Information of the relevant type is more likely to cause harm than good if made widely accessible. Technical tricks, insights, even broad approaches are largely use-agnostic, and most users are either careless or not aimed at alignment at all. In those scenarios where Beth thinks of her idea as an essentially complete solution to alignment which could be implemented immediately, she probably doesn’t strongly need Aaron’s feedback specifically. In any other case, the cost scales proportionally to the reward, and the cost is higher in the current ecosystem. Beth does have an onus to suck it up and turn at the fence, and still: Knowing that this policy exists makes a lot of these thought processes take place in a lot of different minds, and not everyone will be sufficiently cautious. Being the sort of person who’s clever enough to come up with something interesting –something dangerous– does not reliably track being the sort of person who is correctly calibrated in this way. Someone will fail. Maybe they will even internalise this lower security standard because it has proved helpful in being heard at all.

The fact that everyone should look where they are going and not fall into sinkholes does not mean that patching up sinkholes isn’t the right thing to do if you want to prevent injuries. What we’re trying to prevent is a whole lot worse than sinkholes, so I do implore you to start patching. Don’t fail yourself, but also make sure that you’re not the sort of agent who makes it easy for others to fail.

Aaron, who sincerely cares about infohazards being kept secret, and who would take great pains to keep them such, has through his policy created an environment in which critical information is shared more recklessly than it would otherwise be. He did so because he erected a fence without considering what those who really care, those who won’t just turn around dejectedly, might do if they fail to be vigilant for a moment. He failed to consider what dangers lurk on the side path he is sending people onto.

A first step is to actually be aware of this predicament and weigh it: “Is avoiding the cost of keeping secrets really worth the amount of secret sharing I’m encouraging elsewhere?” Rather than just “is avoiding the cost of keeping secrets worth not hearing some interesting ideas?” which is the intuitive default. This is not the bulk of the price you are paying and you need to be aware of this. The second step, if this is still insufficient to change your policy, is to not reveal policies which create harmful incentives. Don’t give a reason why you won’t hear the infohazard, or give a false one. This will feel bad, and it too incurs costs, but they are lesser costs. Whenever you erect a barrier in front of something people desire, you create an incentive to get around the barrier. If you seek to protect this place, it is on you to make sure that your fences are such that the act of circumventing them is not egregiously damaging.