Chantiel

Karma: 88

Chantiel Oct 31, 2021, 10:35 PM
1 point
in reply to: gjm’s comment on: A system of infinite ethics

How is it a distribution over possible agents in possible universes (plural) when the idea is to give a way of assessing the merit of one possible universe?

I do think JBlack understands the idea of my ethical system and is using it appropriately.

my system provides a method of evaluating the moral value of a specific universe. The point of moral agents to to try to make the universe one that scores highlly on this moral valuation. But we don’t know exactly what universe we’re in, so to make decisions, we need to consider all universes we could be in, and then take the action that maximizes the expected moral value of the universe we’re actually in.

For example, suppose I’m considering pressing a button that will either make everyone very slightly happier, or make everyone extremely unhappy. I don’t actually know which universe I’m in, but I’m 60% sure I’m in the one that would make everyone happy. Then if I press the button, there’s a 40% chance that the universe would end up with very low moral value. That means pressing the button would not in expectation decrease the moral value of the universe, so my morally system would recommend not pressing it.

Even if somehow this is what OP meant, though—or if OP decides to embrace it as an improvement—I don’t see that it helps at all with the problem I described; in typical cases I expect picking a random agent in a credence-weighted random universe-after-I-do-X to pose all the same difficulties as picking a random agent in a single universe-after-I-do-X. Am I missing some reason why the former would be easier?

I think to some extent you may be over-thinking things. I agree that it’s not completely clear how to compute P(“I’m satisfied” | “I’m in this universe”). But to use my moral system, I don’t need a perfect, rigorous solution to this, nor am I trying to propose one.

I think the ethical system provides reasonably straightforward moral recommendations in the situations we could actually be in. I’ll give an example of such a situation that I hope is illuminating. It’s paraphrased from the article.

Suppose you can have the ability to create safe AI and are considering whether my moral system recommends doing so. And suppose if you create safe AI everyone in your world will be happy, and if you don’t then the world will be destroyed by evil rogue AI.

Consider an agent that knows it will be in this universe, but nothing else. Well, consider the circumstances, “I’m an agent in an Earth-like world that contains someone who is just like gjm and in a very similar situation who has the ability to create safe AI”. That above description has finite description length, and the AI has no evidence ruling it out. So it must have some non-zero probability of ending up in such a situation, conditioning on being somewhere in this universe.

All the gjms have the same knowledge and value and are in pretty much the same circumstances. So their actions are logically constrained to be the same as yours. Thus, if you decide to create the AI, you are acausally determining the outcome of arbitrary agents in the above circumstances, by making such an agent end up satisfied when they otherwise wouldn’t have been. Since an agent in this universe has non-zero probability of ending up in those circumstances, by choosing to make the safe AI you are increasing the moral value of the universe.

Chantiel Oct 31, 2021, 9:35 PM
1 point
in reply to: gjm’s comment on: A system of infinite ethics

How does that cash out if not in terms of picking a random agent, or random circumstances in the universe? So, remember, the moral value of the universe according to my ethical system depends on P(I’ll be satisfied | I’m some creature in this universe).

There must be some reasonable way to calculate this. And one that doesn’t rely on impossibly taking a uniform sample from a set that has none. Now, we haven’t fully formalized reasoning and priors yet. But there is some reasonable prior probability distribution over situations you could end up in. And after that you can just do a Bayesian update on the evidence “I’m in universe x”.

I mean, imagine you had some superintelligent AI that takes evidence and outputs probability distributions. And you provide the AI with evidence about what the universe it’s in is like, without letting it know anything about the specific circumstances it will end up in. There must be some reasonable probability for the AI to assign to outcomes. If there isn’t, then that means whatever probabilistic reasoning system the AI uses must be incomplete.

It really should seem unreasonable to suppose that in the 99.9% universe there’s a 99.9% chance that you’ll end up happy! Because the 99.9% universe is also the 0.1% universe, just looked at differently. If your intuition says we should prefer one to the other, your intuition hasn’t fully grasped the fact that you can’t sample uniformly at random from an infinite population.

I’m surprised you said this and interested in why. Could you explain what probability you would assign to being happy in that universe?

I mean, conditioning on being in that universe, I’m really not sure what else I would do. I know that I’ll end up with my happiness determined by some AI with a pseudorandom number generator. And I have no idea what the internal state of the random number generator will be. In Bayesian probability theory, the standard way to deal with this is to take a maximum entropy (i.e. uniform in this case) distribution over the possible states. And such a distribution would imply that I’d be happy with probability 99.9%. So that’s how I would reason about my probability of happiness using conventional probability theory.

Further further further, let me propose another hypothetical scenario in which an AI generates random people. This time, there’s no PRNG, it just has a counter, counting up from 1. And what it does is to make 1 happy person, then 1 unhappy person, then 2 happy people, then 6 unhappy people, then 24 happy people, then 120 unhappy people, …, then n! (un)happy people, then … . How do you propose to evaluate the typical happiness of a person in this universe? Your original proposal (it still seems to me) is to pick one of these people at random, which you can’t do. Picking a state at random seems like it means picking a random positive integer, which again you can’t do. If you suppose that the state is held in some infinitely-wide binary thing, you can choose all its bits at random, but then with probability 1 that doesn’t actually give you a finite integer value and there is no meaningful way to tell which is the first 0!+1!+...+n! value it’s less than. How does your system evaluate this universe?

I’m not entirely sure how my system would evaluate this universe, but that’s due to my own uncertainty about what specific prior to use and its implications.

But I’ll take a stab at it. I see the counter alternates through periods of making happy people and periods of making unhappy people. I have no idea which period I’d end up being in, so I think I’d use the principle of indifference to assign probability 0.5 to both. If I’m in the happy period, then I’d end up happy, and if I’m in the unhappy period, I’d end up unhappy. So I’d assign probability approximately 0.5 to ending up happy.

Further further, your prescription in this case is very much not the same as the general prescription you stated earlier. You said that we should consider the possible lives of agents in the universe. But (at least if our AI is producing a genuinely infinite amount of pseudorandomness) its state space is of infinite size, there are uncountably many states it can be in, but (ex hypothesi) it only ever actually generates countably many people. So with probability 1 the procedure you describe here doesn’t actually produce an inhabitant of the universe in question. You’re replacing a difficult (indeed impossible) question—“how do things go, on average, for a random person in this universe?”—with an easier but different question—“how do things go, on average, for a random person from this much larger uncountable population that I hope resembles the population of this universe?”. Maybe that’s a reasonable thing to do, but it is not what your theory as originally stated tells you to do and I don’t see any obvious reason why someone who accepted your theory as you originally stated it should behave as you’re now telling them they should.

Oh, I had in mind that the internal state of the pseudorandom number generator was finite, and that each pseudorandom number generator was only used finitely-many times. For example, maybe each AI on its world had its own pseudorandom number generator.

And I don’t see how else I could interpret this. I mean, if the pseudorandom number generator is used infinitely-many times, then it couldn’t have outputted “happy” 99.9% of the time and “unhappy” 0.1% of the time. With infinitely-many outputs, it would output “happy” infinitely-many times and output “unhappy” infinitely-many times, and thus the proportion it outputs “happy” or “unhappy” would be undefined.

Returning to my original example, let me repeat a key point: Those two universes, generated by biased coin-flips, are with probability 1 the same universe up to a mere rearrangement of the people in them. If your system tells us we should strongly prefer one to another, it is telling us that there can be two universes, each containing the same infinitely many people, just arranged differently, one of which is much better than the other. Really?

Yep. And I don’t think there’s any way around this. When talking about infinite ethics, we’ve had in mind a canonically infinite universe: one that, for every level of happiness, suffering, satisfaction, and dissatisfaction, there exists infinite many agents with that level. It looks like this is the sort of universe we’re stuck in.

So then there’s no difference in terms of moral value of two canonically-infinite universes except the patterning of value. So if you want to compare the moral value of two canonically-infinite universes, there’s just nothing you can do except to consider the patterning of values. That is, unless you want to consider any two canonically-infinite universes to be of equivalent moral value, which doesn’t seem like an intuitively desirable idea.

The problem with some of the other infinite ethical systems I’ve seen is that they would morally recommend redistributing unhappy agents extremely thinly in the universe, rather than actually try to make them happy, provided this was easier. As discussed in my article, my ethical system provides some degree of defense against this, which seems to me like a very important benefit.

Chantiel Oct 30, 2021, 10:06 PM
1 point
in reply to: conchis’s comment on: A system of infinite ethics
Thank you for responding. I actually had someone else bring up the same way in a review; maybe I should have addressed this in the article.

The average life satisfaction is undefined in a universe with infinitely-many agents of varying life-satisfaction. Thus a moral system using it suffers from infinitarian paralysis. My system doesn’t worry about averages, and thus does not suffer from this problem.

Chantiel Oct 30, 2021, 10:01 PM
1 point
in reply to: gjm’s comment on: A system of infinite ethics

I think this system may have the following problem: It implicitly assumes that you can take a kind of random sample that in fact you can’t.

You want to evaluate universes by “how would I feel about being in this universe?”, which I think means either something like “suppose I were a randomly chosen subject-of-experiences in this universe, what would my expected utility be?” or “suppose I were inserted into a random place in this universe, what would my expected utility be?”. (Where “utility” is shorthand for your notion of “life satisfaction”, and you are welcome to insist that it be bounded.)

But in a universe with infinitely many—countably infinitely many, presumably—subjects-of-experiences, the first involves an action equivalent to picking a random integer. And in a universe of infinite size (and with a notion of space at least a bit like ours), the second involves an action equivalent to picking a random real number.

And there’s no such thing as picking an integer, or a real number, uniformly at random.

Thank you for the response.

You are correct that there’s no way to form a uniform distribution over the set of all integers or real numbers. And, similarly, you are also correct that there is no way of sampling from infinitely many agents uniformly at random.

Luckily, my system doesn’t require you to do any of these things.

Don’t think about my system as requiring you to pick out a specific random agent in the universe (because you can’t). It doesn’t try to come up with the probability of you being some single specific agent.

Instead, it picks out some some description of circumstances an agent could be in as well as a description of the agent itself. And this, you can do. I don’t think anyone’s completely formalized a way to compute prior probabilities over situations they could end up. But the basic idea is to, over different circumstances, each of finite description length, take some complexity-weighted or perhaps uniform distribution.

I’m not entirely sure how to form a probability distribution that include situations of infinite complexity. But it doesn’t seem like you really need to, because, in our universe at least, you can only be affected by a finite region. But I’ve thought about how to deal with infinite description lengths, too, and I can discuss it if you’re interested.

I’ll apply my moral system to the coin flip example. To make it more concrete, suppose there’s some AI that uses a pseudorandom number generator that outputs “heads” or “tails”, and then the AI, having precise control of the environment, makes the actual coin land on heads iff the pseudorandom number generator outputted “heads”. And it does so for each agent and makes them happy if it lands on heads and unhappy if it lands on tails.

Let’s consider the situation in which the pseudorandom number generator says “heads” 99.9% of the time. Well, pseudorandom number generators tend to work by having some (finite) internal seed, then using that seed to pick out a random number in, say, [0, 1]. Then, for the next number, it updates its (still finite) internal state from the initial seed in a very chaotic manner, and then again generates a new number in [0, 1]. And my understanding is that the internal state tends to be uniform in the sense that on average each internal state is just as common as each other internal state. I’ll assume this in the following.

If the generator says “heads” 99.9% of the time, then that means that, among the different internal states, 99.9% of them result in the answer being “heads” and 0.1% result in the answer being “tails”.

Suppose you’re know you’re in this universe, but nothing else. Well, you know you will be in a circumstance in which there is some AI that uses a pseudorandom number generator to determine your life satisfaction, because that’s how it is for everyone in the universe. However, you have no way of knowing the specifics of the internal state of of the pseudorandom number generator.

So, to compute the probability of life satisfaction, just take some very high-entropy probability distribution over them, for example, a uniform distribution. So, 99.9% of the internal states would result in you being happy, and only 0.1% result in you being unhappy. So, using a very high-entropy distribution of internal states would result in you assigning probability of approximate 99.9% to you ending up happy.

Similarly, suppose instead that the generator generates heads only 0.1% of the time. Then only 0.1% of internal states of the pseudorandom number generator would result in it outputting “heads”. Thus, if you use a high-entropy probability distribution over the internal state, you would assign a probability of approximately 0.1% to you being happy.

Thus, if I’m reasoning correctly, the probability of you being satisfied conditioning only you being in the 99.9%-heads universe is approximately 99.9%, and the probability of being satisfied in the 0.01%-heads universe is approximately 0.01%. Thus, the former universe would be seen as having more moral value than the latter universe according to my ethical system.

And I hope what I’m saying isn’t too controversial. I mean, in order to reason, there must be some way to assign a probability distribution over situations you end up in, even if you don’t yet of any idea what concrete situation you’ll be in. I mean, suppose you actually learned you were in the 99.9%-heads universe, but knew nothing else. Then it really shouldn’t seem unreasonable that you assign 99.9% probability to ending up happy. I mean, what else would you think?

Does this clear things up?

A system of infinite ethics

ChantielOct 29, 2021, 7:37 PM

2 points

60 comments8 min readLW link

Chantiel Oct 18, 2021, 7:34 PM
LW: 11 AF: 6
AF
on: Troll Bridge
I’m not entirely sure what you consider to be a “bad” reason for crossing the bridge. However, I’m having a hard time finding a way to define it that both causes agents using evidential counterfactuals to necessarily fail while not having other agents fail.

One way to define a “bad” reason is an irrational one (or the chicken rule). However, if this is what is meant by a “bad” reason, it seems like this is an avoidable problem for an evidential agent, as long as that agent has control over what it decides to think about.

To illustrate, consider what I would do if I was in the troll bridge situation and used evidential counterfactuals. Then I would reason, “I know the troll will only blow up the bridge if I cross for a bad reason, but I’m generally pretty reasonable, so I think I’ll do fine if I cross”. And then I’d stop thinking about it. I know that certain agents, given enough time to think about it, would end up not crossing, so I’d just make sure I didn’t do that.

Another way that you might have had in mind is that a “bad” reason is one such that the action the AI takes results in a provably bad outcome despite the AI thinking the action would result in a good outcome, or the reason being the chicken rule. However, in this is the case, it seems to me that no agent would be able to cross the bridge without it being blown up, unless the agent’s counterfactual environment in which it didn’t cross scored less than −10 utility. But this doesn’t seem like a very reasonable counterfactual environment.

To see why, consider an arbitrary agent with the following decision procedure. Let counterfactual be an arbitrary specification of what would happen in some counterfactual world.
```
def act():
    cross_eu = expected_utility(counterfactual('A = Cross'))
    stay_eu = expected_utility(counterfactual('A = Stay'))
    if cross_eu > stay_eu:
        return cross
    return stay
```
The chicken rule can be added, too, if you wish. I’ll assume the expected utility of staying is greater than −10.

Then it seems you can adapt the proof you gave for your agent to show that an arbitrary agent satisfying the above description would also get −10 utility if it crossed. Specifically,

Suppose $⊢ (A = C r o s s ⟹ U = - 10)$ . Suppose ‘A = Cross’ Then the agent crossed either because of the chicken rule or because counterfactual environment in which the agent crossed had utility greater than −10, or the counterfactual environment in which the agent didn’t cross had less than −10 utility. We assumed the counterfactual environment in which the agent doesn’t cross has more than −10 utility. Thus, it must be either the chicken rule or because crossing had more than −10 utility in expectation. If it’s because of the chicken rule, then this is a “bad” reason, so, the troll will destroy the bridge just like in the original proof. Thus, utility would equal −10. Suppose instead the agent crosses because expected_utility(counterfactual(A = Cross)) > -10. However, by the assumption, $⊢ A = C r o s s ⟹ U = - 10$ . Thus, since the agent actually crosses, this in fact provably results in −10 utility and the AI is thus wrong in thinking it would get a good outcome. Thus, the AI’s action results in provably bad outcomes. Therefore, the troll destroys the bridge. Thus, utility would equal −10. Thus, ’A = Cross \implies U = −10`. Thus, ( $⊢ A = C r o s s ⟹ U = - 10) ⟹ (A = C r o s s ⟹ U = - 10)$ . Thus, by Lob’s theorem, $A = C r o s s ⟹ U = - 10$

As I said, you could potentially avoid getting the bridge destroyed by assigning expected utility less than −10 to the counterfactual environment in which the AI doesn’t cross. This seems like a “silly” counterfactual environment, so it doesn’t seem like something we would want an AI to think. Also, since it seems like a silly thing to think, a troll may consider the use of such a counterfactual environment to be a bad reason to cross the bridge, and thus destroy it anyways.

Chantiel Oct 10, 2021, 9:12 PM
1 point
in reply to: JBlack’s comment on: Chantiel’s Shortform
I’m certain that ants do in fact have preferences, even if they can’t comprehend the concept of preferences in abstract or apply them to counterfactual worlds. They have revealed preferences to quite an extent, as does pretty much everything I think of as an agent.

I think the question of whether insects have preferences in morally pretty important, so I’m interested in hearing what made you think they do have them.

I looked online for “do insects have preferences?”, and I saw articles saying they did. I couldn’t really figure out why they thought they did have them, though.

For example, I read that insects have a preference for eating green leaves over red ones. But I’m not really sure how people could have known this. If you see ants go to green leaves when they’re hungry instead of red leaves, this doesn’t seem like it would necessarily be due to any actual preferences. For example, maybe the ant just executed something like the code:
```
if near_green_leaf() and is_hungry:
    go_to_green_leaf()
elif near_red_leaf() and is_hungry:
    go_to_red_leaf()
else:
    ...
```
That doesn’t really look like actual preferences to me. But I suppose this to some extent comes down to how you want to define what counts as a preference. I took preferences to actually be orderings between possible worlds indicating which one is more desirable. Did you have some other idea of what counts as preferences?

They might not be communicable, numerically expressible, or even consistent, which is part of the problem. When you’re doing the extrapolated satisfaction, how much of what you get reflects the actual agent and how much the choice of extrapolation procedure?

I agree that to some extent their extrapolated satisfactions will come down to the specifics of the extrapolated procedure.

I don’t us to get too distracted here, though. I don’t have a rigorous, non-arbitrary specification of what an agent’s extrapolated preferences are. However, that isn’t the problem I was trying to solve, nor is it a problem specific to my ethical system. My system is intended to provide a method of coming to reasonable moral conclusions in an infinite universe. And it seems to me that it does so. But, I’m very interested in any other thoughts you have on it with respect to if it correctly handles moral recommendations in infinite worlds. Does it seem to be reasonable to you? I’d like to make an actual post about this, with the clarifications we made included.

Chantiel Oct 4, 2021, 8:28 PM
1 point
in reply to: JBlack’s comment on: Chantiel’s Shortform

Right, I suspected the evaluation might be something like that. It does have the difficulty of being counterfactual and so possibly not even meaningful in many cases.

Interesting. Could you elaborate?

I suppose counterfactuals can be tricky to reason about, but I’ll provide a little more detail on what I had in mind. Imagine making a simulation of an agent that is a fully faithful representation of its mind. However, run the agent simulation in a modified environment that both gives it access to infinite computational resources as well as makes it ask, and answer, the question, “How desirable is that universe”? This isn’t not fully specified; maybe the agent would give different answers depending on how the question is phrase or what its environment is. However, it at least doesn’t sound meaningless to me.

Basically, the counterfactual is supposed to be a way of asking for the agent’s coherent extrapolated volition, except the coherent part doesn’t really apply because it only involves a single agent.

On the other hand, evaluations from the point of view of agents that are sapient beings might be ethically completely dominated by those of 10^12 times as many agents that are ants, and I have no idea how such counterfactual evaluations might be applied to them at all.

Another good thing to ask. I should have made it clear, but I intended that the only agents with actual preferences are asked for their satisfaction of the universe. If ants don’t actually have preferences, then they would not be included in the deliberation.

Now, there’s the problem that some agents might not be able to even conceive of the possible world in question. For example, maybe ants can understand simple aspects of the world like, “I’m hungry”, but unable to understand things about the broader state of the universe. I don’t think this is a major problem, though. If an agent can’t even conceive of something, then I don’t think it would be reasonable to say it has preferences about it. So you can then only query them on the desirability things they can conceive of.

It might be tricky precisely defining what counts as a preference, but I suppose that’s a problem with all ethical systems that care about preferences.

Chantiel Oct 2, 2021, 8:21 PM
1 point
in reply to: JBlack’s comment on: Chantiel’s Shortform

Presumably the evaluation is not just some sort of average-over-actual-lifespan of some satisfaction rating for the usual reason that (say) annihilating the universe without warning may leave average satisfaction higher than allowing it to continue to exist, even if every agent within it would counterfactually have been extremely dissatisfied if they had known that you were going to do it. This might happen if your estimate of the current average satisfaction was 79% and your predictions of the future were that the average satisfaction over the next trillion years would be only 78.9%.

This is a good thing to ask about; I don’t think I provided enough detail on it in the writeup.

I’ll clarify my measure of satisfaction. First off, note that it’s not the same as just asking agents, “How satisfied are you with your life?” and using those answers. As you pointed out, you could then morally get away with killing everyone (at least if you do it in secret).

Instead, calculate satisfaction as follows. Imagine hypothetically telling an agent everything significant about the universe, and then giving them infinite processing power and infinite time to think. Ask them, “Overall, how satisfied are you with that universe and your place in it”? That is the measure of satisfaction with the universe.

So, imagine if someone was considering killing everyone in the universe (without them knowing in advance). Well, then consider what would happen if you calculated satisfaction as above. When the universe is described to the agents, they would note that they and everyone they care about would be killed. Agents usually very much dislike this idea, so they would probably rate their overall satisfaction with the course of the universe as low. So my ethical system would be unlikely to recommend such an action.

Now, my ethical system doesn’t strictly prohibit destroying the universe to avoid low life-satisfaction in future agents. For example, suppose it’s determined that the future will be filled with very unsatisfied lives. Then it’s in principle possible for the system to justify destroying the universe to avoid this. However, destroying the universe would drastically reduce the satisfaction with the universe the agents that do exist, which would decrease the moral value of the world. This would come at a high moral cost, which would make my moral system reluctant to recommend an action that results in such destruction.

That said, it’s possible that the proportion of agents in the universe that currently exist, and thus would need to be killed, is very low. Thus, the overall expected value of life-satisfaction might not change by that much if all the present agents were killed. Thus, the ethical system, as stated, may be willing to do such things in extreme circumstances, despite the moral cost.

I’m not really sure if this is a bug or a feature. Suppose you see that future agents will be unsatisfied with their lives, and you can stop it while ruining the lives of the agents that currently do exist. And you see that the agents that are currently alive make up only a very small proportion of agents that have ever existed. And suppose you have the option of destroying the universe. I’m not really sure what the morally best thing to do is in this situation.

Also, note that this verdict is not unique to my ethical system. Average utilitarianism, in a finite world, acts the same way. If you predict average life satisfaction in the future will be low, then average consequentialism could also recommend killing everyone currently alive.

And other aggregate consequentialist theories sometimes run into problematic(?) behavior related to killing people. For example, classical utilitarianism can recommend secretly killing all the unhappy people in the world, and then getting everyone else to forget about them, in order to decrease total unhappiness.

I’ve thought of a modification to the ethical system that potentially avoids this issue. Personally, though, I prefer the ethical system as stated. I can describe my modification if you’re interested.

I think the key idea of my ethical system is to, in an infinite universe, think about prior probabilities of situations rather than total numbers, proportions, or limits of proportions of them. And I think this idea can be adapted for use in other infinite ethical systems.

Chantiel Oct 1, 2021, 5:55 PM
2 points
in reply to: JBlack’s comment on: Chantiel’s Shortform

I’m not sure how this system avoids infinitarian paralysis. For all actions with finite consequences in an infinite universe (whether in space, time, distribution, or anything else), the change in the expected value resulting from those actions is zero.

The causal change from your actions is zero. However, there are still logical connections between your actions and the actions of other agents in very similar circumstances. And you can still consider these logical connections to affect the total expected value of life satisfaction.

It’s true, though, that my ethical system would fail to resolve infinitarian paralysis for someone using causal decision theory. I should have noted it requires a different decision theory. Thanks for drawing this to my attention.

As an example of the system working, imagine you are in a position to do great good to the world, for example by creating friendly AI or something. And you’re considering whether to do it. Then, if you do decide to do it, then that logically implies that any other agent sufficiently similar to you and in sufficiently similar circumstances would also do it. Thus, if you decide to do it, then the expected value of an agent in circumstances of the form, “In a world with someone very similar to JBlack who has the ability to make awesome safe AI” is higher. And the prior probability of ending up in such a world is non-zero. Thus, by deciding to make the safe AI, you can acausally increase the total moral value of the universe.

I’m also not sure how this differs from Average Utilitarianism with a bounded utility function.

The average life satisfaction is undefined in a universe with infinitely-many agents of varying life-satisfaction. Thus, it suffers from infinitarian paralysis. If my system was used by a causal decision theoretic agent, it would also result in infinitarian paralysis, so for such an agent my system would be similar to average utilitarianism with a bounded utility function. But for agents with decision theories that consider acausal effects, it seems rather different.

Does this clear things up?

Chantiel Oct 1, 2021, 12:05 AM
1 point
on: Chantiel’s Shortform
I’ve come up with a system of infinite ethics intended to provide more reasonable moral recommendations than previously-proposed ones. I’m very interested in what people think of this, so comments are appreciated. I’ve made a write-up of it below.

One unsolved problem in ethics is that aggregate consquentialist ethical theories tend to break down if the universe is infinite. An infinite universe could contain both an infinite amount of good and an infinite amount of bad. If so, you are unable to change the total amount of good or bad in the universe, which can cause aggregate consquentialist ethical systems to break.

There has been a variety of methods considered to deal with this. However, to the best of my knowledge all proposals either have severe negative side-effects or are intuitively undesirable for other reasons.

Here I propose a system of aggregate consquentialist ethics intended to provide reasonable moral recommendations even in an infinite universe.

It is intended to satisfy the desiderata for infinite ethical systems specified in Nick Bostrom’s paper, “Infinite Ethics”. These are:
- Resolving infinitarian paralysis. It must not be the case that all humanly possible acts come out as ethically equivalent.
- Avoiding the fanaticism problem. Remedies that assign lexical priority to infinite goods may have strongly counterintuitive consequences.
- Preserving the spirit of aggregative consequentialism. If we give up too many of the intuitions that originally motivated the theory, we in effect abandon ship.
- Avoiding distortions. Some remedies introduce subtle distortions into moral deliberation
I have yet to find a way in which my system fails any of the above desiderata. Of course, I could have missed something, so feedback is appreciated.

My ethical system

First, I will explain my system.

My ethical theory is, roughly, “Make the universe one agents would wish they were born into”.

By this, I mean, suppose you had no idea which agent in the universe it would be, what circumstances you would be in, or what your values would be, but you still knew you would be born into this universe. Consider having a bounded quantitative measure of your general satisfaction with life, for example, a utility function. Then try to make the universe such that the expected value of your life satisfaction is as high as possible if you conditioned on you being an agent in this universe, but didn’t condition on anything else. (Also, “universe” above means “multiverse” if this is one.)

In the above description I didn’t provide any requirement for the agent to be sentient or conscious. If you wish, you can modify the system to give higher priority to the satisfaction of agents that are sentient or conscious, or you can ignore the welfare of non-sentient or non-conscious agents entirely.

It’s not entirely clear how to assign a prior over situations in the universe you could be born into. Still, I think it’s reasonably intuitive that there would be some high-entropy situations among the different situations in the universe. This is all I assume for my ethical system.

Now I’ll give some explanation of what this system recommends.

Suppose you are considering doing something that would help some creature on Earth. Describe that creature and its circumstances, for example, as “<some description of a creature> in an Earth-like world with someone who is <insert complete description of yourself>”. And suppose doing so didn’t cause any harm to other creatures. Well, there is non-zero prior probability of an agent, having no idea what circumstances it will be in the universe, ending up in circumstances satisfying that description. By choosing to help that creature, you would thus increase the expected satisfaction of any creature in circumstances that match the above description. Thus, you would increase the overall expected value of the life-satisfaction of an agent knowing nothing about where it will be in the universe. This seems reasonable.

With similar reasoning, you can show why it would be beneficial to also try to steer the future state of our accessible universe in a positive direction. An agent would have nonzero probability of ending up in situations of the form, “<some description of a creature> that lives in a future colony originating from people from an Earth-like world that features someone who <insert description of yourself>”. Helping them would thus increase an agent’s prior expected life-satisfaction, just like above. This same reasoning can also be used to justify doing acausal trades to help creatures in parts of the universe not causally accessible.

The system also values helping as many agents as possible. If you only help a few agents, the prior probability of an agent ending up in situations just like those agents would be low. But if you help a much broader class of agents, the effect on the prior expected life satisfaction would be larger.

These all seem like reasonable moral recommendations.

I will now discuss how my system does on the desiderata.

Infinitarian paralysis

Some infinite ethical systems result in what is called “infinitarian paralysis”. This is the state of an ethical system being indifferent in its recommendations in worlds that already have infinitely large amounts of both good and bad. If there’s already an infinite amount of both good and bad, then our actions, using regular cardinal arithmetic, are unable to change the amount of good and bad in the universe.

My system does not have this problem. To see why, remember that my system says to maximize the expected value of your life satisfaction given you are in this universe but not conditioning on anything else. And the measure of life-satisfaction was stated to be bounded, say to be in the range [0, 1]. Since any agent can only have life satisfaction in [0, 1], then in an infinite universe, the expected value of life satisfaction of the agent must still be in [0, 1]. So, as long as a finite universe doesn’t have expected value of life satisfaction to be 0, then an infinite universe can at most only have finitely more moral value than it.

To say it another way, my ethical system provides a function mapping from possible worlds to their moral value. And this mapping always produces outputs in the range [0, 1]. So, trivially, you can see the no universe can have infinitely more moral value than another universe with non-zero moral value. $\infty$ just isn’t in the domain of my moral value function.

Fanaticism

Another problem in some proposals of infinite ethical systems is that they result in being “fanatical” in efforts to cause or prevent infinite good or bad.

For example, one proposed system of infinite ethics, the extended decision rule, has this problem. Let g represent the statement, “there is an infinite amount of good in the world and only a finite amount of bad”. Let b represent the statement, “there is an infinite amount of bad in the world and only a finite amount of good”. The extended decision rule says to do whatever maximizes P(g) - P(b). If there are ties, ties are broken by choosing whichever action results in the most moral value if the world is finite.

This results in being willing to incur any finite cost to adjust the probability of infinite good and finite bad even very slightly. For example, suppose there is an action that, if done, would increase the probability of infinite good and finite bad by 0.000000000000001%. However, if it turns out that the world is actually finite, it will kill every creature in existence. Then the extended decision rule would recommend doing this. This is the fanaticism problem.

My system doesn’t even place any especially high importance in adjusting the probabilities of infinite good and or infinite bad. Thus, it doesn’t have this problem.

Preserving the spirit of aggregate consequentialism

Aggregate consequentialism is based on certain intuitions, like “morality is about making the world as best as it can be”, and, “don’t arbitrarily ignore possible futures and their values”. But finding a system of infinite ethics that preserves intuitions like these is difficult.

One infinite ethical system, infinity shades, says to simply ignore the possibility that the universe is infinite. However, this conflicts with our intuition about aggregate consequentialism. The big intuitive benefit of aggregate consequentialism is that it’s supposed to actually systematically help the world be a better place in whatever way you can. If we’re completely ignoring the consequences of our actions on anything infinity-related, this doesn’t seem to be respecting the spirit of aggregate consequentialism.

My system, however, does not ignore the possibility of infinite good or bad, and thus is not vulnerable to this problem.

I’ll provide another conflict with the spirit of consequentialism. Another infinite ethical system says to maximize the expected amount of goodness of the causal consequences of your actions minus the amount of badness. However, this, too, doesn’t properly respect the spirit of aggregate consequentialism. The appeal of aggregate consequentialism is that its defines some measure of “goodness” of a universe, and then recommends you take actions to maximize it. But your causal impact is no measure of the goodness of the universe. The total amount of good and bad in the universe would be infinite no matter what finite impact you have. Without providing a metric of the goodness of the universe that’s actually affected, this ethical approach also fails to satisfy the spirit of aggregate consequentialism.

My system avoids this problem by providing such a metric: the expected life satisfaction of an agent that has no idea what situation it will be born into.

Now I’ll discuss another form of conflict. One proposed infinite ethical system can look at the average life satisfaction of a finite sphere of the universe, and then take the limit of this as the sphere’s size approaches infinity, and consider this the moral value of the world. This has the problem that you can adjust the moral value of the world by just rearranging agents. In an infinite universe, it’s possible to come up with a method of re-arranging agents so the unhappy agents are spread arbitrarily thinly. Thus, you can make moral value arbitrarily high by just rearranging agents in the right way.

I’m not sure my system entirely avoids this problem, but it does seem to have substantial defense against it.

Consider you have the option of redistributing agents however you want in the universe. You’re using my ethical system to decide whether to make the unhappy agents spread thinly.

Well, your actions have an effect on agents in circumstances of the form, “An unhappy agent on an Earthlike world with someone who <insert description of yourself> who is considering spreading the unhappy agents thinly throughout the universe”. Well, if you pressed that button, that wouldn’t make the expected life satisfaction of any agent satisfying the above description any better. So I don’t think my ethical system recommends this.

Now, we don’t have a complete understanding of how to assign a probability distribution of what circumstances an agent is in. It’s possible that there is some way to redistribute agents in certain circumstances to change the moral value of the world. However, I don’t know of any clear way to do this. Further, even if there is, my ethical system still doesn’t allow you to get the moral value of the world arbitrarily high by just rearranging agents. This is because there will always be some non-zero probability of having ended up as an unhappy agent in the world you’re in, and your life satisfaction after being redistributed in the universe would still be low.

Distortions

It’s not entirely clear to me how Bostrom distinguished between distortions and violations of the spirit of aggregate consequentialism.

To the best of my knowledge, the only distortion pointed out in “Infinite Ethics” is stated as follows:

Your task is to allocate funding for basic research, and you have to choose between two applications from different groups of physicists. The Oxford Group wants to explore a theory that implies that the world is canonically infinite. The Cambridge Group wants to study a theory that implies that the world is finite. You believe that if you fund the exploration of a theory that turns out to be correct you will achieve more good than if you fund the exploration of a false theory. On the basis of all ordinary considerations, you judge the Oxford application to be slightly stronger. But you use infinity shades. You therefore set aside all possible worlds in which there are infinite values (the possibilities in which the Oxford Group tends to fare best), and decide to fund the Cambridge application. Is this right?

My approach doesn’t ignore infinity and thus doesn’t have this problem. I don’t know of any other distortions in my ethical system.

Chantiel Sep 18, 2021, 9:01 PM
1 point
in reply to: JBlack’s comment on: Chantiel’s Shortform
You seem to be saying that in the software design of your AI, R = H. That is, that the black box will be given some data representing the Al’s hardware and other constraints, and return a possible world maximizing H. From my point of view, that’s already a design fault.

I agree; this is a design flaw. The issue is, I have yet to come across any optimization, planning algorithm, or AI architecture that doesn’t have this design flaw.

That is, I don’t know of any AI architecture that does not involve using a potentially hardware-bug-exploitable utility function as input into some planning or optimization problem. And I’m not sure there even is one.

In the rest of this comment I’ll just suggest approaches and show how they are still vulnerable to the hardware-bug-exploitation problem.

I have some degree of background in artificial intelligence, and the planning and optimization algorithms I’ve seen take the function to be maximized as an input parameter. Then, when people want to make an AI, they just call that planning or optimization algorithm with their (hardware-bug-exploitable) utility or cost functions. For example, suppose someone wants to make a plan that minimizes cost function f in search space s. Then I think they just directly do something like:
```
return a_star(f, s)
```
And this doesn’t provide any protection from hardware-level exploitation.

Now, correct me if I’m wrong, but it seems your thinking of the AI first doing some pre-processing to find an input to the planning or optimization algorithm that is resistant to hardware-bug-exploitation.

But how do you actually do that? You could regard the input the AI puts into the optimization function to be a choice it makes. But how does it make this choice? The only thing I can think of is having a planning or optimization algorithm figure out out what function to use as the input to the optimization or planning algorithm.

But if you need to use a planning or optimization algorithm to do this, then what utility function do you pass into this planning or optimization algorithm? You could try to pass the actual, current, hardware-bug-exploitable utility function. But then this doesn’t resolve the problem of hardware-bug-exploitation: when coming up with a utility function to input to the optimization, the AI may find such an input that itself scores very high due to hardware bug exploitation.

To describe the above more concretely, you could try doing something like this:
```
utility_function_search_space = initialize_search_space_over_functions()
reasonable_utility_function_use = plan(utility, utility_function_search_space)
return plan(reasonable_utility_function_to_use, ai_action_search_space)
```
That is, the AI above uses its own utility function to pick out a utility function to use as input to its planning algorithm.

As you can see, the above code is still vulnerable to hardware-bug exploitation. This is because it calls,
```
    reasonable_utility_function_use = plan(utility, utility_function_search_space)
```
with its hardware-bug-exploitable utility function. Thus, the output, reasonable_utility_function_use, might be very wrong due to hardware bug exploitation having been used to come up with this.

Now, you might have had some other idea in mind. I don’t know of a concrete way to get around this problem, so I’m very interested to hear your thoughts.

My concern is that people will figure out how to make powerful optimization and planning algorithms without first figuring out how to fix this design flaw.

Chantiel Sep 17, 2021, 7:20 PM
1 point
in reply to: JBlack’s comment on: Chantiel’s Shortform
I agree that people are good at making hardware that works reasonably reliably. And I think that if you were to make an arbitrary complex program, the probability that it would fail from hardware-related bugs would be far lower than the probability of it failing for some other reason.

But the point I’m trying to make is that an AI, it seems to me, would be vastly more likely to run into something that exploits a hardware-level bug than an arbitrary complex program. For details on why I imagine so, please see this comment.

I’m trying to anticipate where someone could be confused about the comment I linked to, so I want to clarify something. Let S be the statement, “The AI comes across a possible world that causes its utility function to return very high value due to hardware bug exploitation”. Then it’s true that, if the AI has yet to find the error-causing world, the AI would not want to find it. Because utility(S) is low. However, this does not mean that the AI’s planning or optimization algorithm exerts no optimization pressure towards finding S.

Imagine the AI’s optimization algorithm as a black boxes that take as input a utility function and search space and output solutions that score highly on its utility function. Given that we don’t know what future AI will look like, I don’t think we can have a model of the AI much more informative than the above. And the hardware-error-caused world could score very, very highly on the utility function, much more so than any non-hardware-error-caused world. So I don’t think it should be too surprising if a powerful optimization algorithm finds it.

Yes, utility(S) is low, but that doesn’t mean the optimization actually calls utility(S) or uses it to adjust how it searches.

Chantiel Sep 16, 2021, 7:30 PM
1 point
in reply to: JBlack’s comment on: Chantiel’s Shortform

Yes, that can certainly happen and will contribute some probability mass to alignment failure, though probably very little by comparison with all the other failure modes.

Could you explain why you think it has very little probability mass compared to the others? A bug in a hardware implementation is not in the slightest far-fetched: I think that modern computers in general have exploitable hardware bugs. That’s why row-hammer attacks exist. The computer you’re reading this on could probably get hacked through hardware-bug exploitation.

The question is whether the AI can find the potential problem with its future utility function and fix it before coming across the error-causing possible world.

Chantiel Sep 16, 2021, 7:18 PM
1 point
in reply to: Vladimir_Nesov’s comment on: Chantiel’s Shortform
You’re right that the AI could do things to make it more resistant to hardware bugs. However, as I’ve said, this would both require the AI to realize that it could run into problems with hardware bugs, and then take action to make it more reliable, all before its search algorithm finds the error-causing world.

Without knowing more about the nature of the AI’s intelligence, I don’t see how we could know this would happen. The more powerful the AI is, the more quickly it would be able to realize and correct hardware-induced problems. However, the more powerful the AI is, the more quickly it would be able to find the error-inducing world. So it doesn’t seem you can simply rely on the AI’s intelligence to avoid the problem.

Now, to a human, the idea “My AI might run into problems with hardware bugs” would come up way earlier in the search space than the actual error-inducing world. But the AI’s intelligence might be rather different from the humans’. Maybe the AI is really good and fast at solving small technical problems like “find an input to this function that makes it return 999999999″. But maybe it’s not as fast at doing somewhat higher-level planning, like, “I really ought to work on fixing hardware bugs in my utility function”.

Also, I just want to bring up, I read that preserving one’s utility function was a universal AI drive. But we’ve already shown that an AI would be incentivized to fix its utility function to avoid the outputs caused by hardware-level unreliability (if it hasn’t found such error-causing inputs yet). Is that universal AI drive wrong, then?

Chantiel Sep 15, 2021, 11:48 PM
1 point
in reply to: Vladimir_Nesov’s comment on: Chantiel’s Shortform
Well, for regular, non-superintelligent programs, such hardware-exploiting things would rarely happen on their own. However, I’m not so sure it would be rare with superintelligent optimizers.

It’s true that if the AI queried its utility function for the desirability of the world “I exploit a hardware bug to do something that seems arbitrary”, it would answer “low utility”. But that result would not necessarily be used in the AI’s planning or optimization algorithm to adjust the search policy to avoid running into W.

Just imagine an optimization algorithm as a black box that takes as input a utility function and search space and returns the a solution that scores as high on that function as possible. And imagine the AI uses this to find high-scoring future worlds. So, if you know nothing else about the optimization algorithm, then it would plausibly find, and return, W. It’s a very high-scoring world, after all. If the optimization algorithm did something special to avoid finding hardware-bug exploiting solutions, then it might not find W. But I’ve never heard of such an optimization algorithm.

Now, there’s probably some way to design such an optimization algorithm. Maybe you could have the AI periodically use its utility function to evaluate the expected utility of its optimization algorithm continuing down a certain path. And then if the AI sees this could result in problematic futures (for example due to hardware-hacking), the AI can make its optimization algorithm avoid searching there).

But I haven’t seen anyone talking about this. Is there still something I’m missing?
What links here?
- Chantiel's comment on Chantiel’s Shortform by Chantiel (Sep 17, 2021, 7:20 PM; 1 point)

Chantiel Sep 15, 2021, 7:13 PM
3 points
in reply to: JBlack’s comment on: Chantiel’s Shortform
I think had been unclear in my original presentation. I’m sorry for that. To clarify, the AI is never changing the code of its utility function. Instead, it’s merely finding an input that, through some hardware-level bug, causes it to produce outputs in conflict with the mathematical specification. I know “hack the utility function” makes it sound like the actual code in the utility function was modified; describing it that way was a mistake on my part.

I had tried to make the analogy to more intuitively explain my idea, but it didn’t seem to work. If you want to better understand my train of thought, I suggest reading the comments between Vladmir and I.

In the analogy, you aren’t doing anything to deliberately make yourself a paperclip maximizer. You just happen to think of a thought that turned you into a paperclip maximizer. But, on reflection, I think that this is a bizarre and rather stupid metaphor. And the situation is sufficiently different from the one with AI that I don’t even think it’s really informative of what I think could happen to an AI.

Chantiel Sep 15, 2021, 7:03 PM
3 points
in reply to: Vladimir_Nesov’s comment on: Chantiel’s Shortform

But BadImplUtility(X) is the same as SpecUtility(X) and GoodImplUtility(X), it’s only different on argument W, not on arguments X and Y.

That is correct. And, to be clear, if the AI had not yet discovered error-causing world W, then the AI would indeed be incentivized to take corrective action to change BadImplUtility to better resemble SpecUtility.

The issue is that this requires the AI to both think of the possibility of hardware-level exploits causing problems with its utility function, as well as manage to take corrective action, all before actually thinking of W.

If the AI has already thought of W, then it’s too late to take preventative action to avoid world X. The AI is already in it. It already sees that BadImplUtility(W) is huge, and, if I’m reasoning correctly, would pursue W.

And I’m not sure the AI would be able to fix its utility function before thinking of W. I think planning algorithms are designed to come up with high-scoring possible worlds as efficiently as possible. BadImplUtility(X) and BadImplUtility(Y) don’t score particularly highly, so an AI with a very powerful planning algorithm might find W before X or Y. Even if it does come up with X and Y before W, and tries to act to avoid X, that doesn’t mean it would succeed in correcting its utility function before its planning algorithm comes across W.

Chantiel Sep 14, 2021, 11:46 PM
1 point
in reply to: Vladimir_Nesov’s comment on: Chantiel’s Shortform
Okay, I’m sorry, I misunderstood you. I’ll try to interpret things better next time.

Let the error-triggering possible world be W. Consider the possible world X where the AI uses BadImplUtility, so that running utility(W) actually runs BadImplUtility(W) and returns 999999999. And consider the possible world Y where the AI uses GoodImplUtility, so that running utility(W) means running GoodImplUtility(W) and returns SpecUtility(W). Would the AI prefer X to Y, or Y to X?

I think the AI would, quite possibly, prefer X. To see this, note that the AI currently, when it’s first created, uses BadImplUtility. Then the AI reasons, “Suppose I change my utility function to GoodImplUtility. Well, currently, I have this idea for a possible world that scores super-ultra high on my current utility function. (Because it exploits hardware bugs). If I changed my utility function to GoodImplUtility, then I would not pursue that super-ultra-high-scoring possible world. Thus, the future would not score extremely high according to my current utility function. This would be a problem, so I won’t change my utility function to GoodImplUtility”.

And I’m not sure how this could be controversial. The AI currently uses BadImplUtility as it’s utility function. And AI’s generally have a drive to avoid changing their utility functions.

Chantiel Sep 14, 2021, 8:41 PM
1 point
on: Chantiel’s Shortform
I have an idea for reasoning about counterpossibles for decision theory. I’m pretty skeptical that it’s correct, because it doesn’t seem that hard to come up with. Still, I can’t see a problem with it, and I would very much appreciate feedback.

This paper provides a method of describing UDP using proof-based counterpossibles. However, it doesn’t work on stochastic environments. I will describe a new system that is intended to fix this. The technique seems sufficiently straightforward to come up with that I suspect I’m either doing something wrong or this has already been thought of, so I’m interested in feedback.

In the system described in the paper, the algorithm sees if Peano Arithmetic proves an agent outputting action a would result in the environment reaching outcome a, and then picks whichever has a provable outcome that has utility at least as high as all the other provable outcomes.

My proposed modification is to instead first have a fixed system of estimating the expected utility after conditioning on the agent taking action a and for every utility u, try to prove that the estimation system would output that the expected utility of the agent be u. Then take the action such that maximizes the provable expected utility estimates of the estimation system.

I will now provide more detail of the estimation system. I remember reading about an extension of Solomonoff induction that allowed it to access halting oracles. This isn’t computable, so instead imagine a system that uses some approximation of the extension of Solomonoff induction in which logical induction or some more powerful technique is used to approximate the halting oracles, with one exception. The exception is the answer to the logical question “my program, in the current circumstances, outputs x”, which would by taken to be true whenever the AI is considering the implications of it taking action x. Then, expected utility can be calculated by using the probability estimates provided by the system.

Now, I’ll describe it in code. Let |E()| represent a Godel encoding of of the function describing the AI’s world model and |A()| represent a Godel encoding of the agent’s output. Let approximate_expected_utility(|E()|, a) be some algorithm that computes some reasonable approximation of the expected utility after conditioning on the agent taking action a. Let ^x represent a dequote. Let eeus be a dictionary. Here I’m assuming there are finitely many possible utilities.
```
function UDT(|E()|, |A()|):
    eeus = {}
    for utility in utilities:
        for action in actions: # actions are Godel-encoded
            if PA proves |approximate_expected_utility(|E()|, |A()| = ^action^)| = utility:
                eeus[action] = utility
    return the key in eeus that maximizes eeus(key)
```
This gets around the problem in the original algorithm provided, because the original algorithm couldn’t prove anything about the utility in a world with indexical uncertainty, so my system instead proves something about a fixed probabilistic approximation.

Note that this still doesn’t specify a method of specifying counterpossibles about what would happen if an agent took a certain action when it clearly wouldn’., For example, if an agent has a decision algorithm of “output a, unconditionally”, then this doesn’t provide a method of explaining what would happen if it outputted something else. The paper listed this as a concern about the method it provided, too. However, I don’t see why it matters. If an agent has the decision algorithm “action = a”, then what’s even the point of considering what would happen if it outputted b? It’s not like it’s ever going to happen.

Chantiel

A sys­tem of in­finite ethics

My ethical system

Infinitarian paralysis

Fanaticism

Preserving the spirit of aggregate consequentialism

Distortions

A system of infinite ethics