I can see two ways.
First, boring: assign bounded utilities over everything and very large disutility on violating constraint, such that >1% chance of violating constraint doesn’t worth it.
Second: throw away most part of utilitarian framework and design agent to work under rules in limited environment, if agent ever leaves environment, it throws exception and waits for your guidance.
First is unexploitable because it’s simply utility maximizer. Second is presumably unexploitable, because we (presumably) designed exception for every possibility of being exploited.
I can see two ways. First, boring: assign bounded utilities over everything and very large disutility on violating constraint, such that >1% chance of violating constraint doesn’t worth it. Second: throw away most part of utilitarian framework and design agent to work under rules in limited environment, if agent ever leaves environment, it throws exception and waits for your guidance. First is unexploitable because it’s simply utility maximizer. Second is presumably unexploitable, because we (presumably) designed exception for every possibility of being exploited.