I’ve made a few posts that seemed to contain potentially valuable ideas related to AI safety. However, I got almost no feedback on them, so I was hoping some people could look at them and tell me what they think. They still seem valid to me, and if they are, they could potentially be very valuable contributions. And if they aren’t valid, then I think knowing the reason for this could potentially help me a lot in my future efforts towards contributing to AI safety.
I made a new article about defining “optimizer”. I was wondering if someone could look over it and tell me what they think before I post it on Less Wrong. You can find it here.
There is a matter I’m confused about: What exactly is base-level reality, does it necessarily exist, and is it ontologically different from other constructs?
First off, I had gotten the impression that there was a base-level reality, and that in some sense it’s ontologically different from the sorts of abstractions we use in our models. I thought that, it some sense, the subatomic particles “actually” existed, whereas our abstractions, like chairs, were “just” abstractions. I’m not actually sure how I got this impression, but I had the sense that other people thought this way, too.
And indeed, you could adopt an epistemology that would imply this. But I’m not sure what the benefit of doing so would be. Suppose people discovered lower-level particles that composed quantum particles, and modeling using these lower-level particles would provide high predictive accuracy than using mere quantum physics. But then suppose people discover sub-sub-quantum particles and that modeling the world in terms of these sub-sub-particles further yielded a more accurate world model than just modeling with sub-quantum particles. And what if this process continued forever: people just kept finding lower-level particles that composed higher-level particles and had higher predictive accuracy.
In the above situation, what’s supposed to be taken to be base-level reality? Now, if you wanted, you could imagine that the world actually does have a base-level reality in the form of an infinite-memory computer, and that this computer dynamically generates new abstractions to uses them to compute what the agents see, making sure that it manages to start simulating things at a lower level of abstraction before any agent could reach the current “base-level” reality.
But that doesn’t seem like a very natural hypothesis. If you keep finding more and more decompositions forever, it really seems to me that “there’s no base-level reality” would be a simpler and more natural hypothesis.
Distinguishing the physical world from mathematical entities is pragmatic, reflects how it relates to you. It’s impossible to fully know what the physical world is, but it’s possible to interact with it (and to care about what happens in it), and these interactions depend on what it is. When reasoning about constructed mathematical entities, you get to know what you are working with, but not in the case of the physical world. So we can similarly consider an agent living in a different mathematical entity, and for that agent that mathematical entity would be their real physical world.
Because we have to deal with the presence of the real world, it might be convenient to develop concepts that don’t presume knowledge of its nature, which should apply to mathematical entities if we forget (in some perspective) what they are. It’s also relevant to recall that the idea of a “mathematical entity” is informal, so strictly speaking it doesn’t make sense to claim that the physical world is “a mathematical entity”, because we can’t carefully say what exactly “a mathematical entity” is in general, there are only more specific examples that we don’t know the physical world to be one of.
Reality is that which actually exists, regardless of how any agents within it might perceive it, choose to model it, or describe it to each other.
If reality happens to be infinitely complex, then all finite models of it must necessarily be incomplete. That might be annoying, but why would you consider that to mean “reality doesn’t really exist”?
Well, to be clear, I didn’t intend to say that reality doesn’t really exist. There’s definitely something that’s real. I was just wondering about if there is some base-level reality that’s ontologically different from other things, like the abstractions we use.
Now, what I’m saying feels pretty philosophical, and perhaps the question isn’t even meaningful.
Still, I’m wondering about the agents making an infinite sequence of decompositions that each have increased predictive accuracy. What would the base-level reality be in that case? Any of the decompositions the agents create would be wrong, even if some are infinitely complex.
Also, I’ve realize I’m confused about the meaning of “what really exists”, but I think it would be hard to clarify and reason about this. Perhaps I’m overthinking things, but I am still rather confused.
I’m imagining some other agent or AI that doesn’t distinguish between base-level reality and abstractions, I’m not sure how I could argue with them. I mean, in principle, I think you could come up with reasoning systems that distinguish between base-level reality and abstractions, as well as reasoning systems that don’t, that both make equally good empirical predictions. If there was some alien that didn’t make the distinction in their epistemology or ontology, I’m not sure how I could say, and support saying, “You’re wrong”.
I mean, I predict you could both make arbitrarily powerful agents with high predictive accuracy and high optimization-pressure that don’t distinguish between base-level reality and abstractions, and could do the same with agents that do make such a distinction. If both perform fine, them I’m not sure how I could argue that one’s “wrong”.
Is the existence of base-level reality subjective? Does this question even make sense?
We are probably just using different terminology and talking past each other. You agree that there is “something that’s real”. From my point of view, the term “base-level reality” refers to exactly that which is real, and no more. The abstractions we use do not necessarily correspond with base-level reality in any way at all. In particular if we are any of simulated entities, dreaming, full-sensory hallucinating, disembodied consciousness, or brains in jars with synthetic sensory input then we may not have any way to learn anything meaningful about base-level reality, but that does not preclude its existence because it is still certain that something exists.
Still, I’m wondering about the agents making an infinite sequence of decompositions that each have increased predictive accuracy. What would the base-level reality be in that case? Any of the decompositions the agents create would be wrong, even if some are infinitely complex.
None of the models are any sort of reality at all. At best, they are predictors of some sort of sensory reality (which may be base-level reality, or might not). It is possible that all of the models are actually completely wrong, as the agents have all been living in a simulation or are actually insane with false memories of correct predictions, etc.
Is the existence of base-level reality subjective? Does this question even make sense?
The question makes sense, but the answer is the most emphatic NO that it is possible to give. Even in some hypothetical solipsistic universe in which only one bodiless mind exists and anything else is just internal experiences of that mind, that mind objectively exists.
It is conceivable to suppose a universe in which everything is a simulation in some lower-level universe resulting in an ordering with no least element to qualify as base-level reality, but this is still an objective fact about such a universe.
We do seem have have been talking past each other to some extent. Base-level reality, for course, exists if you define it to be “what really exists”.
However, I’m a little unsure about if that’s how people use the word. I mean, if someone asked me if Santa really exists, I’d say “No”, but if they asked if chairs really existed, I’d say “Yes”. That doesn’t seem wrong to me, but I thought our base-level reality only contained subatomic particles, not chairs. Does this mean the statement “Chairs really exist” is actually wrong? Or I am misinterpreting?
I’m also wondering how people justify thinking that models talking about things like chairs, trees, and anything other than subatomic particles don’t “really exist”. Is this even true?
I’m just imagining talking with some aliens with no distinction between base-level reality and what we would consider mere abstractions. For example, suppose the aliens knew about chairs, when they discovered quantum theory, they said say, “Oh! There are these atom things, and when they’re arrange in the right way, they cause chairs to exist!” But suppose they never distinguished between the subatomic particles being real and they chairs being real: they just saw both subatomic particles and chairs to both be fully real, and the correct arrangement of the former caused the latter to exist.
How could I argue with such aliens? They’re already making correct predictions, so I don’t see any way to show them evidence that disproves them. Is there some abstract reason to think models about thing like chairs don’t “really exist”?
The main places I’ve see the term “base-level reality” used are in discussions about the simulation hypothesis. “Base-level” being the actually real reality where sensory information tells you about interactions in the actual real world, as opposed to simulations where the sensory information is fabricated and almost completely independent of the rules that base-level reality follows. The abstraction is that the base-level reality serves as a foundation on which (potentially) a whole “tower” of simulations-within-simulations-within-simulations could be erected.
That semantic excursion aside, you don’t need to go to aliens to find beings that hold subatomic particles as being ontologically equivalent with chairs. Plenty of people hold that they’re both abstractions that help us deal with the world we live in, just at different length scales (and I’m one of them).
Well, even in a simulation, sensory information still tells you about interactions in the actual real world. I mean, based on your experiences in the simulation, you can potentially approximately infer the algorithm and computational state of the “base-level” computer you’re running in, and I believe those count as interactions in the “actual real world”. And if your simulation is sufficiently big and takes up a sufficiently large amount of the world, you could potentially learn quite a lot about the underlying “real world” just by examining your simulation.
That said, I still can’t say I really understand the concept of “base-level reality”. I know you said its what informs you about the “actual real world”, but this feels similarly confusing to me as defining base-level reality as “what really exists”. I know that reasoning and talking about things so abstract is hard and can easily lead to nonsense, but I’m still interested.
I’m curious about what even the purpose is of having an ontologically fundamental distinction between base-level reality and abstractions and whether it’s worth having. When asking, “Should I treat base-level reality and abstractions as fundamentally distinct?”, I think I good way to approximate this is by asking “Would I want an AI to reason as if its abstractions and base-level reality were fundamentally distinct?”
And I’m not completely sure they should. AIs, to reason practically, need to use “abstractions” in at least some of their models. If you want, you could have a special “This is just an abstraction” or “this is base-level reality” tag on each of your models, but I’m not sure what the benefit of this would be or what you would use it for.
Even without such a distinction, an AI would have both models that would be normally considered abstractions, as well as those of what you would think of as base-level reality, and would select which models to use based on their computational efficiency and the extent to which they are relevant and informative to the topic at hand. That sounds like a reasonable thing to do, and I’m not clear how ascribing fundamental difference to “abstractions” and “base-level” reality would do better than this.
If the AI talks with humans that use the phrase “base-level reality”, then it could potentially be useful for the AI to come up with an is-base-level-reality predicate in its world model in order to model things that answer, “When will this person call something base-level reality?” But such an predicate wouldn’t be treated as fundamentally different from any other predicate, like “Is a chair”.
When asking, “Should I treat base-level reality and abstractions as fundamentally distinct?”, I think I good way to approximate this is by asking “Would I want an AI to reason as if its abstractions and base-level reality were fundamentally distinct?”
Do you want an AI to be able to conceive of anything along the lines of “how correct is my model”, to distinguish hypothetical from actual, or illusion from substance?
If you do, then you want something that fits in the conceptual space pointed at by “base-level reality”, even if it doesn’t use that phrase or even have the capability to express it.
I suppose it might be possible to have a functioning AI that is capable of reasoning and forming models without being able to make any such distinctions, but I can’t see a way to do it that won’t be fundamentally crippled compared with human capability.
I’m interested in your thoughts on how the AI would be crippled.
I don’t think it would be crippled in terms of empirical predictive accuracy, at least. The AI could till come come up with all the low-level models like quantum physics, as well as keep the abstract ones like “this is what a chair is”, and then just use whichever it needs to make the highest possible predictive accuracy in a given circumstances.
If the AI is built to make and run quantum physics experiments, then in order to have high predictive accuracy is would need to learn and use an accurate model of quantum physics. But I don’t see why you would need a distinction between base-level reality and abstractions to do that.
The AI could still learn a sense of “illusion”. If the AI is around psychotic people who have illusions a lot, then I don’t see what’s stopping the AI from forming a model model saying, “Some people experience these things called ‘illusions’, and it makes them take the wrong action or wrong predictions as specified in <insert model of how people react to illusions”.
And I don’t see why the AI wouldn’t be able to consider the possibility that it also experiences illusions. For example, suppose the AI is in the desert and keeps seeing what looks like an oasis. But when the AI gets closer, it sees only sand. To have higher predictive accuracy in this situation, the AI could learn a (non-ontologically fundamental) “is-an-illusion” predicate.
Would the crippling me in terms of scoring highly on its utility function, rather than just predicting percepts? I don’t really see how this would be a problem. I mean, suppose you want an AI to make chairs. Then even if the AI lacked a notion of base-level reality, it could still learn an accurate models of how chairs work and how they are manufactured. Then the AI could have its utility function defined in terms of it’s notion of chairs to make it make chairs.
Could you give any specific example in which an AI using no ontologically fundamental notion of base-level reality would either make the wrong prediction or make the wrong action, in a way that would be avoided by using such a notion?
This feels like a bait-and-switch since you’re now talking about this in terms of an “ontologically fundamental” qualifier where previously you were only talking about “ontologically different”.
To you, does the phrase “ontologically fundamental” mean exactly the same thing as “ontologically different”? It certainly doesn’t to me!
It was a mistake for me to conflate “ontologically fundamental” and “ontologically different.
Still, I had in mind that they were ontologically different in some fundamental way. It was my mistake to merely use the word “different”. I had imagined that to make an AI that’s reasonable, it would actually make sense to hard-code some notion of base-level reality as well as abstractions, and to treat them differently. For example, you could have the AI have a single prior over “base-level reliaty”, then just come up with whatever abstractions that work well with predictively approximating the base-level reality. Instead it seems like the AI could just learn the concept of “base-level reality” like it would learn any other concept. Is this correct?
Also, in the examples I gave, I think the AI wouldn’t actually have needed a notion of base-level reality. The concept of a mirage is different from the concept of non-base-level reality. So is the concept of a mental illusion. Understanding both of those is different than understanding the concept of base-level reality.
If humans use the phrase “base-level reality”, I still don’t think it would be strictly necessary for an AI to have the concept. The AI could just know rules of the form, “If you ask a human if x is base-level reality, they will say ‘yes’ in the following situations...”, and then describe the situations.
So it doesn’t seem to me like the actual concept of “base-level reality” is essential, though it might be helpful. Of course, I might of course be missing or misunderstanding something. Corrections are appreciated.
The concept of a mirage is different from the concept of non-base-level reality.
Different in a narrow sense yes. “Refraction through heated air that can mislead a viewer into thinking it is reflection from water” is indeed different from “lifetime sensory perceptions that mislead about the true nature and behaviour of reality”. However, my opinion is that any intelligence that can conceive of the first without being able to conceive of the second is crippled by comparison with the range of human thought.
...lifetime sensory perceptions that mislead about the true nature and behaviour of reality
I don’t think you would actually need a concept of base-level reality to conceive of this.
First off, let me say that’s it seems pretty hard coming up with lifetime sensory precepts that would mislead about reality. Even if the AI was in a simulation, the physical implementation is part of reality. And the AI could learn about it. And from this, the AI could also potentially learn about the world outside the simulation. AIs commonly try to come up with the simplest (in terms of description length), most predictively accurate model of their percepts they can. And I bet the simplest models would involve having a world outside the simulation with specified physics, that would result in the simulations being built.
That said, lifetime sensory percepts can still mislead. For example, the simplest, highest-prior models that explain the AI’s percepts might say it’s in a simulation run by aliens. However, suppose the AI’s simulation actually just poofed into existed without a cause, and the rest of the world is filled with giant hats and no aliens. An AI, even without a distinction between base-level reality and abstractions, would still be able to come up with this model. If this isn’t a model involving percepts misleading you about the nature of reality, I’m not sure what is. So it seems to me that such AIs would be able to conceive of the idea of percepts misleading about reality. And the AIs would assign low probability to being in the all-hat world, just as they should.
Even if the AI was in a simulation, the physical implementation is part of reality. And the AI could learn about it.
The only means would be errors in the simulation.
Any underlying reality that supports Turing machines or any of the many equivalents can simulate every computable process. Even in the case of computers with bounded resources, there are corresponding theorems that show that the process being computed does not depend upon the underlying computing model.
So the only thing that can be discerned is that the underlying reality supports computation, and says essentially nothing about the form that it takes.
An AI, even without a distinction between base-level reality and abstractions, [...] would be able to conceive of the idea of percepts misleading about reality
How can it conceive of the idea of percepts misleading about reality if it literally can’t conceive of any distinction between models (which are a special case of abstractions) and reality?
Well, the only absolute guarantee the AI can make is that the underlying reality supports computation.
But it can still probabilistically infer other things about it. Specifically, the AI knows not only that the underlying reality supports computation, but also that there was some underlying process that actually created the simulation it’s in. Even though Conway’s Game of Life can allow for arbitrary computation, many possible configurations of the world state would result in no AI simulations being made. The configurations that would result in AI simulations being made would likely involve some sort of intelligent civilization creating the simulations. So the AI could potentially predict the existence of this civilization and infer some things about it.
Regardless, even if the AI can’t infer anything else about outside reality, I don’t see how this is a fault of not having a notion of base-level reality. I mean, if you’re correct, then it’s not clear to me how an AI with a notion of base-level reality would do inferentially better.
How can it conceive of the idea of percepts misleading about reality if it literally can’t conceive of any distinction between models (which are a special case of abstractions) and reality?
Well, as I said before, the AI could still consider the possibility that the world is composed entirely of hats (minus the AI simulation). The AI could also have a model of Bayesian inference and infer that the Bayesian probability that would be rational to assign to “the world is all hats” is low and its evidence makes it even lower. So, by combining these two models, the AI can come up with a model that says, “The world is all hats, even though everything I’ve seen, according to probability theory, makes it seem like this isn’t the case”. That sounds like a model about the idea of percepts misleading about reality.
I know we’ve been going back and forth a lot, but I think these are pretty interesting things to talk about, so I thank you for the discussion.
It might help if you try to describe a specific situation in which the AI makes the wrong prediction or takes the wrong action for its goals. This could help be better understand what you’re thinking about.
Well, as I said before, the AI could still consider the possibility that the world is composed entirely of hats (minus the AI simulation).
At this point I’m not sure there’s much point in discussing further. You’re using words in ways that seem self-contradictory to me.
You said “the AI could still consider the possibility that the world is composed of [...]”. Considering a possibility is creating a model. Models can be constructed about all sorts of things: mathematical statements, future sensory inputs, hypothetical AIs in simulated worlds, and so on. In this case, the AI’s model is about “the world”, that is to say, reality.
So it is using a concept of model, and a concept of reality. It is only considering the model as a possibility, so it knows that not everything true in the model is automatically true in reality and vice versa. Therefore it is distinguishing between them. But you posited that it can’t do that.
To me, this is a blatant contradiction. My model of you is that you are unlikely to post blatant contradictions, so I am left with the likelihood that what you mean by your statements is wholly unlike the meaning I assign to the same statements. This does not bode well for effective communication.
Yeah, it might be best to wrap up the discussion. It seems we aren’t really understanding what the other means.
So it is using a concept of model, and a concept of reality. It is only considering the model as a possibility, so it knows that not everything true in the model is automatically true in reality and vice versa. Therefore it is distinguishing between them. But you posited that it can’t do that.
Well, I can’t say I’m really following you there. The AI would still have a notion of reality. It just would consider abstractions like chairs and tables to be part of reality.
There is one thing I want to say though. We’ve been discussing the question of if a notion of base-level reality is necessary to avoid severe limitations in reasoning ability. And to see why I think it’s not, just consider regular humans. They often don’t have a distinction between base-level reality and abstractions. And yet, they can still reason about the possibility of life-long illusions as well as function well to accomplish their goals. And if you taught someone the concept of “base-level reality”, I’m not sure it would help them much.
It sounds like you’re using very different expectations for those questions, as opposed to the very rigorous interrogation of base reality. ‘Does Santa exist?’ and ‘does that chair exist?’ are questions which (implicitly, at least) are part of a system of questions like ‘what happens if I set trip mines in my chimney tonight?’ and ‘if I try to sit down, will I fall on my ass?’ which have consequences in terms of sensory input and feedback. You can respond ‘yes’ to the former, if you’re trying to preserve a child’s belief in Santa (although I contend that’s a lie) and you can truthfully answer ‘no’ to the latter if you want to talk about an investigation of base reality.
Of course, if you answer ‘no’ to ‘does that chair exist?’ your interlocutor will give you a contemptful look, because that wasn’t the question they were asking, and you knew that, and you chose to answer a different question anyway.
I choose to think of this as different levels of resolution, or as varying bucket widths on a histograph. To the question ‘does Jupiter orbit the Sun?’ you can productively answer ‘yes’ if you’re giving an elementary school class a basic lesson on the structure of the solar system. But if you’re trying to slingshot a satellite around Ganymede, the answer is going to be no, because the Solar-Jovian barycenter is way outside the solar corona, and at the level you’re operating, that’s actually relevant.
Most people don’t use the words ‘reality’ or ‘exist’ in the way we’re using it here, not because people are idiots, but because they don’t have a coherent existential base for non-idiocy, and because it’s hard to justify the importance of those questions when you spend your whole life in sensory reality.
As to the aliens, well, if they don’t distinguish between base level reality and abstractions, they can make plenty of good sensory predictions in day-to-day life, but they may run into some issues trying to make predictions in high-energy physics. If they manage to do both well, it sounds like they’re doing a good job operating across multiple levels of resolution. I confess I don’t have a strong grasp on the subject, or on the differences between a model being real versus not being real in terms of base reality, I’m gonna wait on JBlack’s response to that.
I generally agree with the content of the articles you linked, and that there are different notions of “really exist”. The issue is, I’m still not sure what “base-level reality” means. JBlack said it was what “really exists”, but since JBlack seems to be using a notion of “what really exists” that’s different from the one people normally use, I’m not really sure what it means.
In the end, you can choose to define “what really exists” or “base-level reality” however you want, but I’m still wondering about what people normally take them to mean.
I try to avoid using the word ‘really’ for this sort of reason. Gets you into all sorts of trouble.
(a) JBlack is using a definition related to simulation theory, and I don’t know enough about this to speculate too much, but it seems to rely on a hard discontinuity between base and sensory reality.
(b) Before I realized he was using it that way, I thought the phrase meant ‘reality as expressed on the most basic level yet conceivable’ which, if it is possible to understand it, explodes the abstractions of higher orders and possibly results in their dissolving into absurdity. This is a softer transition than the above.
(c) I figure most people use ‘really exist’ to refer to material sensory reality as opposed to ideas. This chair exists, the Platonic Idea of a chair does not. The rule with this sort of assumption is ‘if I can touch it, or it can touch me, it exists’ for a suitably broad understanding of ‘touch.’
(d) I’ve heard some people claim that the only things that ‘really exist’ are those you can prove with mathematics or deduction, and mere material reality is a frivolity.
(e) I know some religious people believe heavily in the primacy of God (or whichever concept you want to insert here) and regard the material world as illusory, and that the afterlife is the ‘true’ world. You can see this idea everywhere from the Kalachakra mandala to the last chapter of the Screwtape letters.
I guess the one thing uniting all these is that, if it were possible to take a true Outside View, this is what you would see; a Platonic World of ideas, or a purely material universe, or a marble held in the palm of God, or a mass of vibrating strings (or whatever the cool kids in quantum physics are thinking these days) or a huge simulation of any of the above instantiated on any of the above.
I think most people think in terms of option c, because it fits really easily into a modern materialist worldview, but the prevalence of e shouldn’t be downplayed. I’ve probably missed some important ones.
I had made a post proposing a new alignment technique. I didn’t get any responses, but it still seems like a reasonable idea to me, so I’m interested in hearing what others think of it. I think the basic idea of the post, if correct, could be useful for future study. However, I don’t want to waste time doing this if the idea is unworkable for a reason I hadn’t thought of.
(If you’re interested, please read the post before reading below.)
Of course, the idea’s not a complete solution to alignment, and things have a risk of going catastrophically wrong due to other problems, like unreliable reasoning. But it still seems to me that it’s potentially helpful for outer alignment and corrigability.
If the humans actually directly answer any query about the desirability of an outcome, then it’s hard for me to see a way this system wouldn’t be outer-aligned.
Now, consulting humans every time results in a very slow objective function. Most optimization algorithms I know of rely on huge numbers of queries to the objective function, so using these algorithms with humans manually implementing the objective function would be infeasible. However, I don’t see anything in principle impossible with coming up with an optimization algorithm that scores well on its objective function even if that function is extremely slow. Even if the technique I described to do in the post this was wrong, I haven’t seen anyone looking into this, so it doesn’t seem clearly unworkable to me.
Even if this does turn out to be intractable, I think the basic motivation of my post still has the potential to be useful. The main motivation of my post is to have a hard-coded method of querying humans before making major strategic decisions and to update its beliefs about what is desirable with their responses. But that is a technique that could be used in other AI systems as well. It wouldn’t solve the everything, of course, but it could provide an additional level of safety. I’m not sure if this idea has been discussed before.
I also have yet to find anything seriously problematic about the method I did provided to optimize with limited calls to the objective function. There could of course be some I haven’t thought of, though.
I found what seems to be a potentially dangerous false-negative in the most popular definition of optimizer. I didn’t get a response, so I would appreciate feedback on if it’s reasonable. I’ve been focusing on defining “optimizer”, so I think feedback would help me a lot. You can see my comment here .
I’ve realized I’m somewhat skeptical of the simulation argument.
The simulation argument proposed by Bostrom argued, roughly, that either almost exactly all Earth-like worlds don’t reach a posthuman level, almost exactly all such civilizations don’t go on to build many simulations, or that we’re almost certainly in a simulation.
Now, if we knew that the only two sorts of creatures that experience what we experience are either in simulations or the actual, original, non-simulated Earth, then I can see why the argument would be reasonable. However, I don’t know how we could know this.
For example, consider zoos: Perhaps advanced aliens create “zoos” featuring humans in an Earth-like world, for their own entertainment or other purposes. These wouldn’t necessarily be simulations of any actual other planet, but might merely have been inspired by actual planets. Similarly, lions in the zoo are similar to lions in the wild, and their enclosure features plants and other environmental feature similar to what they would experience in the wild. But I wouldn’t call lions in zoos simulations of wild lions, even if the developed parts where humans could view them was completely invisible to them and their enclosure was arbitrarily large.
Similarly, consider games: Perhaps aliens create games or something like them set in Earth-like worlds that aren’t actually intended to be simulations of any particle world. Similarly, human fantasy RPGs often have a medieval theme, so maybe aliens would create games set in a modern-Earth-like world, without having in mind any actual planet to simulate.
Now, you could argue that in an infinite universe, these things are all actually simulations, because there must be some actual, non-simulated world that’s just like the “zoo” or game. However, by that reasoning, you could argue that a rock you pick up is nothing but a “rock simulation” because you know there is at least one other rock in the universe with the exact same configuration and environment as the rock you’re holding. That doesn’t seem right to me.
Similarly, you could say, then, that I’m actually in a simulation right now. Because even if I’m in the original Earth, there is some other Chantiel in the universe in a situation identical to my current one, who is logically constrained to do the same thing I do, so thus I am a simulation of her. And my environment is thus a simulation of hers.
Now, if we knew that the only two sorts of creatures that experience what we experience are either in simulations or the actual, original, non-simulated Earth, then I can see why the argument would be reasonable. However, I don’t know how we could know this.
For example, consider zoos: Perhaps advanced aliens create “zoos” featuring humans in an Earth-like world, for their own entertainment or other purposes.
This falls under either #1 or #2, since you don’t say what human capabilities are in the zoo or explain how exactly this zoo situation matters to running simulations; do we go extinct at some time long in the future when our zookeepers stop keeping us alive (and “go extinct before reaching a “posthuman” stage”), having never become powerful zookeeper-level civs ourselves, or are we not permitted to (“extremely unlikely to run a significant number of simulations”)?
Similarly, consider games: Perhaps aliens create games or something like them set in Earth-like worlds that aren’t actually intended to be simulations of any particle world.
This is just fork #3: “we are in a simulation”. At no point does fork #3 require it to be an exact true perfect-fidelity simulation of an actual past, and he is explicit that the minds in the simulation may be only tenuously related to ‘real’/historical minds; if aliens would be likely to create Earth-like worlds, for any reason, that’s fine because that’s what necessary, because we observe an Earth-like world (see the indifference principle section).
he is explicit that the minds in the simulation may be only tenuously related to ‘real’/historical minds;
Oh, I guess I missed this. Do you know where Bostrom said the “simulations” can only tenuously related to real minds? I was rereading the paper but didn’t see mention of this. I’m just surprised, because normally I don’t think zoo-like things would be considered simulations.
This falls under either #1 or #2, since you don’t say what human capabilities are in the zoo or explain how exactly this zoo situation matters to running simulations; do we go extinct at some time long in the future when our zookeepers stop keeping us alive (and “go extinct before reaching a “posthuman” stage”), having never become powerful zookeeper-level civs ourselves, or are we not permitted to (“extremely unlikely to run a significant number of simulations”)?
In case I didn’t make it clear, I’m saying that even if a significant proportion of civilization reach a post-human stage and a significant proportion of these run simulations, there would still potentially be a non-small chance of actually not being in a simulation an instead being in a game or zoo. For example, suppose each post-human civilization makes 100 proper simulations and 100 zoos. Then even if parts 1 and 2 of the simulation argument are true, you still have a 50% chance of ending up in a zoo.
“If the real Chantiel is so correlated with you that they will do what you will do, then you should believe you’re real so that the real Chantiel will believe they are real, too. This holds even if you aren’t real.”
By “real”, do you mean non-simulated? Are you saying that even if 99% of Chantiels in the universe are in simulations, then I should still believe I’m not in one? I don’t know how I could convince myself of being “real” if 99% of Chantiels aren’t.
Do you perhaps mean I should act as if I were non-simulated, rather than literally being non-simulated?
It doesn’t matter how many fake versions of you hold the wrong conclusion about their own ontological status, since those fake beliefs exist in fake versions of you. The moral harm caused by a single real Chantiel thinking they’re not real is infinitely greater than infinitely many non-real Chantiels thinking they are real.
Interesting. When you say “fake” versions of myself, do you mean simulations? If so, I’m having a hard time seeing how that could be true. Specifically, what’s wrong about me thinking I might not be “real”? I mean, if I though I was in a simulation, I think I’d do pretty much the same things I would do if I thought I wasn’t in a simulation. So I’m not sure what the moral harm is.
Do you have any links to previous discussions about this?
I am also skeptical of the simulation argument, but for different reasons.
My main issue is: the normal simulation argument requires violating the Margolus–Levitin theorem[1], as it requires that you can do an arbitrary amount of computation[2] via recursively simulating[3].
This either means that the Margolus–Levitin theorem is false in our universe (which would be interesting), we’re a ‘leaf’ simulation where the Margolus–Levitin theorem holds, but there’s many universes where it does not (which would also be interesting), or we have a non-zero chance of not being in a simulation.
This is essentially a justification for ‘almost exactly all such civilizations don’t go on to build many simulations’.
Call the scaling factor—of amount of computation necessary to simulate X amount of computation - C. So e.g.C=0.5 means that to simulate 1 unit of computation you need 2 units of computation. If C≥1, then you can violate the Margolus–Levitin theorem simply by recursively sub-simulating far enough. If C<1, then a universe that can do X computation can simulate no more than CX total computation regardless of how deep the tree is, in which case there’s at least a 1−C chance that we’re in the ‘real’ universe.
My main issue is: the normal simulation argument requires violating the Margolus–Levitin theorem[1], as it requires that you can do an arbitrary amount of computation[2] via recursively simulating[3].
No, it doesn’t, any more than “Godel’s theorem” or “Turing’s proof” proves simulations are impossible or “problems are NP-hard and so AGI is impossible”.
If C≥1, then you can violate the Margolus–Levitin theorem simply by recursively sub-simulating far enough. If C<1, then a universe that can do X computation can simulate no more than CX total computation regardless of how deep the tree is, in which case there’s at least a 1−C chance that we’re in the ‘real’ universe.
There are countless ways to evade this impossibility argument, several of which are already discussed in Bostrom’s paper (I think you should reread the paper) eg. simulators can simply approximate, simulate smaller sections, tamper with observers inside the simulation, slow down the simulation, cache results like HashLife, and so on. (How do we simulate anything already...?)
All your Margolus-Levitin handwaving can do is disprove a strawman simulation along the lines of a maximally dumb pessimal 1:1 exact simulation of everything with identical numbers of observers at every level.
No, it doesn’t, any more than “Godel’s theorem” or “Turing’s proof” proves simulations are impossible or “problems are NP-hard and so AGI is impossible”.
I don’t follow your logic here, which probably means I’m missing something. I agree that your latter cases are invalid logic. I don’t see why that’s relevant.
simulators can simply approximate
This does not evade this argument. If nested simulations successively approximate, total computation decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
simulate smaller sections
This does not evade this argument. If nested simulations successively simulate smaller sections, total computation decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
tamper with observers inside the simulation
This does not evade this argument. If nested simulations successively tamper with observers, this does not affect total computation—total computation still decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
slow down the simulation
This does not evade this argument. If nested simulations successively slow down, total computation[1] decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
cache results like HashLife
This does not evade this argument. Using HashLife, total computation still decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
(How do we simulate anything already...?)
By accepting a multiplicative slowdown per level of simulation in the infinite limit[2], and not infinitely nesting.
See note 2 in the parent: “Note: I’m using ‘amount of computation’ as shorthand for ‘operations / second / Joule’. This is a little bit different than normal, but meh.”
You absolutely can, in certain cases, get no slowdown or even a speedup by doing a finite number of levels of simulation. However, this does not work in the limit.
This does not evade this argument. If nested simulations successively approximate, total computation decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
No, it evades the argument by showing that what you take as a refutation of simulations is entirely compatible with simulations. Many impossibility proofs prove an X where people want it to prove a Y, and the X merely superficially resembles a Y.
This does not evade this argument. If nested simulations successively simulate smaller sections, total computation decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
No, it evades the argument by showing that what you take as a refutation of simulations is entirely compatible with simulations. Many impossibility proofs prove an X where people want it to prove a Y, and the X merely superficially resembles a Y.
This does not evade this argument. If nested simulations successively tamper with observers, this does not affect total computation—total computation still decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
No, it...
This does not evade this argument. If nested simulations successively slow down, total computation[1] decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
No, it...
This does not evade this argument. Using HashLife, total computation still decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
No, it...
Reminder: you claimed:
My main issue is: the normal simulation argument requires violating the Margolus–Levitin theorem[1], as it requires that you can do an arbitrary amount of computation[2] via recursively simulating[3].
The simulation argument does not require violating the M-L theorem to the extent it is superficially relevant and resembles an impossibility proof of simulations.
Are you saying that we can’t be in a simulation because our descendants might go on to build a large number of simulations themselves, requiring too many resources in the base reality? But I don’t think that weakens the argument very much, because we aren’t currently in a position to run a large number of simulations. Whoever is simulating us can just turn off/reset the simulation before that happens.
Said argument applies if we cannot recursively self-simulate, regardless of reason (Margolus–Levitin theorem, parent turning the simulation off or resetting it before we could, etc).
In order for ‘almost all’ computation to be simulated, most simulations have to be recursively self-simulating. So either we can recursively self-simulate (which would be interesting), we’re rare (which would also be interesting), or we have a non-zero chance we’re in the ‘real’ universe.
The argument is not that generic computations are likely simulated, it’s about our specific situation—being a newly intelligent species arising in an empty universe. So simulationists would take the ‘rare’ branch of your trilemma.
If you’re stating that generic intelligence was not likely simulated, but generic intelligence in our situationwaslikely simulated...
Doesn’t that fall afoul of the mediocrity principle applied to generic intelligence overall?
(As an aside, this does somewhat conflate ‘intelligence’ and ‘computation’; I am assuming that intelligence requires at least some non-zero amount of computation. It’s good to make this assumption explicit I suppose.)
Doesn’t that fall afoul of the mediocrity principle applied to generic intelligence overall?
Sure. I just think we have enough evidence to overrule the principle, in the form of sensory experiences apparently belonging to a member of a newly-arisen intelligent species. Overruling mediocrity principles with evidence is common.
I had recently posted a question asking about if iterated amplification was actually more powerful than mere mimicry and arguing that it was not. I had thought I was making a pretty significant point, but the post attracted very little attention. I’m not saying this is a bad thing, but I’m not really sure why it happened, so I would appreciate some insight about how I can contribute more usefully.
Iterated amplification seems to be the leading proposal for created aligned AI, so I thought a post arguing against it, if correct, would be a useful contribution. Perhaps there is some mistake in my reasoning, but I have yet to see any mentioned. It’s possible that people have already thought of this consideration and posted about it, but I have yet to find any, so I’m not really sure.
Would it have been better posting it as an actual post instead of framing it as a question? I have some more to say to argue for mimicry than I mentioned in the question; would it be worthwhile for me to add it and then post this as a non-question post?
It’s true that most problems could be delegated to uploads, and any specific design is a design that the uploads could come up with just as well or better. The issue is that we don’t have uploads, and most plans to get them before AGI involve the kind of hypothetical AI know-how that might easily be used to build an agentic AGI, the risk the uploads are supposed to resolve.
Thus the “humans” of a realistic implementation of HCH are expected to be vague imitations of humans that only function somewhat sensibly in familiar situations and for a short time, not fully functional uploads, and most of the point of the specific designs is to mitigate the imperfection of their initial form, to make something safe/useful out of this plausibly feasible ingredient. One of the contentious points about this is whether it’s actually possible to build something useful (let alone safe) out of such imperfect imitations, even if we build a large system out of them that uses implausible amount of resources. This is what happens with an HCH that can use an infinite number of actual uploads (exact imitations) that are still restricted to an hour or a day of thinking/learning (and then essentially get erased, that is can’t make further use of the things they learned). Designing something safe/useful in the exact imitation HCH setting is an easier problem than doing so in a realistic setting, so it’s a good starting point.
Thanks for the response. To be clear, when discussing mimics, I did not have in mind perfect uploads of people. Instead, they could indeed be rather limited imitations. For example, an AI designing improvements to itself doesn’t need to actually have a generally faithful imitation of human behavior. Instead, it could just know a few things, like, “make this algorithm score better on this thing without taking over the world”.
Still, I can see how, when it comes to especially limited imitations, iterated amplification could be valuable. This seems especially true if the imitations are unreliable in even narrow situations. It would be problematic is an AI tasked with designing powerful AI didn’t get the “act corrigibly, and don’t take over the world” part reliably right.
I’ve been thinking about what you’ve said about iterated amplification, and there are some things I’m unsure of. I’m still rather skeptical of the benefit of iterated amplification, so I’d really appreciate a response.
You mentioned that iterated amplification can be useful when you have only very limited, domain-specific models of human behavior, where such models would be unable to come up with the ability to create code. However, there are two things I’m wondering about. The first is that it seems to me that, for a wide range of situations, you need a general and robustly accurate model of human behavior to perform well. The second is that, even if you don’t have a general model of human behavior, it seems to me that it’s sufficient to only have one amplification step, which I suppose isn’t iterated amplification. And the big benefit to avoiding iterated amplification is that iterated amplification results in exponential decreases in reliability from compounding errors on each distillation step, but with a single amplification step, this exponential decrease in reliability wouldn’t occur.
For the first topic, suppose your AI is trained to make movies. I think just about every human value is relevant to the creation of movies, because humans usually like movies with a happy ending, and to make an ending happy you need to understand what humans consider a “happy ending”.
Further, you would need an accurate model of human cognitive capabilities. To make a good movie, it needs to be easy enough for humans to understand. But sometimes it also shouldn’t be too easy, because that can remove the mystery of it.
And the above is not just true for movies: I think creating other forms of entertainment would involve the same things as above.
Could you do the above with only some domain-limited model of what counts as confusing or a good or bad ending in the context of movies? It’s not clear to me that this is possible. Movies involve a very wide variety of situations, and you need to keep things understandable and resulting in a happy ending in all of those circumstances. I don’t see how could you robustly do the above without a general model of what people people find confusing or otherwise bad.
Further, whenever an AI needs to explain something to humans, it seems to me that it’s important that it has an accurate model of what humans can understand and not understand. Is there any way to do this with purely domain-specific models rather than with a general understanding of what people find confusing? It’s not clear to me that this is possible. For example, imagine an AI that needs to explain many different things. Maybe it’s tasked with creating learning materials or making the news. With such a broad category of things the AI needs to explain, it’s really not clear to me how an AI could do this without a general model of what makes things confusing or not.
Also more generally, it seems to me that whenever the AI is involved with human interaction in novel circumstances, it will need an accurate model of what people like and dislike. For example, consider an AI tasked with coming up with a plan for human workers. Doing so has the potential to involve an extremely wide range of values. For example, humans generally value novelty, autonomy, not feeling embarrassed, not being bored, not being overly pressured, not feeling offended, and not seeing disgusting or ugly things.
Could you have an AI learn to avoid things things with only domain-specific models, rather than a general understanding of what people value and disvalue? I’m not sure how to do this. Maybe you could learn models that work for reflecting people’s values in limited circumstances. However, I think an essential component of intelligence is to come up with novel plans involving novel situations. And I don’t see how an agent could do this without a general understanding of values. For example, the AI might create entire new industries, and it would be important that any human workers in those industries would have satisfactory conditions.
Now, for the second topic: using amplification without iteration.
First off, I want to note that, even without a general model of humans, it’s still not really clear to me that you need any amplification at all. As I’ve said before, even mere human imitation the potential to result in extremely high intelligence simply by doing the same things humans do, but much faster. As I mentioned previously, consider the human output to be published research papers from top researchers, and the AI is tasked with mimicking it. Then the AI could take the research papers as the human output and use this to create future papers but far far faster.
But suppose you do still need amplification. Then I don’t see why one amplification step wouldn’t be enough. I think that if you put together a sufficiently large number of intelligent humans and give them unlimited time to think, they’d be able to solve pretty much anything that iterated amplification with HCH would be able to solve. So, instead of having multiple amplification and distillation steps, you could instead just have one very large amplification step that would involve a large enough number of humans models interacting that it could solve pretty much anything.
If the amplification step involve a sufficiently large number of people, you might be concerned that it would be intractable to emulate them all.
I’m not sure if this would be a problem. Consider again the AI designed to mimic the research papers of top researchers. I think that often a small number of top researchers are responsible for a large proportion of research progress, so the AI could potentially just see that output of the top, say, 100 or 1000 researchers working together would be. And the AI would potentially be able to produce the outputs of each researcher with far less computation. That sounds plausibly like enough to me.
But suppose that’s not enough, and emulating every human individually during the amplification step is intractable. Then here’s how I think you can get around this: train not only a human model, but also a system of approximating the output of an expensive computation with much lower computational cost. Then, for the amplification step, you can define an computing involving an extremely large number of interacting emulated humans, and then allow the approximation system to come up with approximations to this without needing to directly emulate every human.
To give a sense of how this might work, note that in a computation, often a small amount of the parts of the computation account for a large part of the output. For example, if you are trying to approximate a computation about gravity, commonly only the closest, most massive objects have significant gravitational effect on something, and you can ignore the rest. Similarly, rather than simulate individual atoms, it’s much more efficient to come up with groups of large number of atoms, and consider their effect as a group. The same is true for other computations involving many small components.
To emulate humans, you could potentially do the same things as you would when simulating gravity. Specifically, an AI may be able to consider groups of humans and infer what the final output of that group will be, without actually needing to emulate each one individually. Further, for very challenging topics, many people may fail to contribute anything to the final result, so the could potentially avoid emulating them at all.
So I still can’t really see the benefit of iterated amplification. Of course, I could be missing something, so I’m interesting in hearing what you think.
One potential problem is that it might be hard to come up with good training data for an arbitrary-function-approximator, since finding the exact output of expensive functions would be expensive. However, it’s not clear to me how big of a problem this would be. As I’ve said before, even the output of a 100 or 1000 humans interacting could potentially be all the AI ever needs, and with sufficient fast approximations of individual humans, this could be tractable to create training data for.
Further, I bet the AI could learn a lot about arbitrary-function approximation just by training on approximating functions that are already reasonably fast the compute. I think the basic techniques to quickly approximating functions are what I mentioned before: come up with abstract objects that involve groups of individual components, and know when to stop performing the computation on a certain object because it’s clear it will have little effect on the final result.
Amplification induces a dynamic in the model space, it’s a concept of improving models (or equivalently in this context, distributions). This can be useful when you don’t have good datasets, in various ways.
For robustness, you have a dataset that’s drawn from the wrong distribution, and you need to act in a way that you would’ve acted if it was drawn from the correct distribution. If you have an amplification dynamic that moves models towards few attractors, then changing the starting point (training distribution compared to target distribution) probably won’t matter. At that point the issue is for the attractor to be useful with respect to all those starting distributions/models. This doesn’t automatically make sense, comparing models by usefulness doesn’t fall out of the other concepts.
For chess, you’d use the idea of winning games (better models are those that win more, thus amplification should move models towards winning), which is not inherent in any dataset of moves. For AGI, this is much more nebulous, but things like reflection (thinking about a problem longer, conferring with others, etc.) seem like a possible way of bootstrapping a relevant amplification, if goodharting is kept in check throughout the process.
For robustness, you have a dataset that’s drawn from the wrong distribution, and you need to act in a way that you would’ve acted if it was drawn from the correct distribution. If you have an amplification dynamic that moves models towards few attractors, then changing the starting point (training distribution compared to target distribution) probably won’t matter. At that point the issue is for the attractor to be useful with respect to all those starting distributions/models. This doesn’t automatically make sense, comparing models by usefulness doesn’t fall out of the other concepts.
Interesting. Do you have any links discussing this? I read Paul Christiano’s post on reliability amplification, but couldn’t find mention of this. And, alas, I’m having trouble finding other relevant articles online.
Amplification induces a dynamic in the model space, it’s a concept of improving models (or equivalently in this context, distributions). This can be useful when you don’t have good datasets, in various ways. Also it ignores independence when talking about recomputing things
Yes, that’s true. I’m not claiming that iterated amplification doesn’t have advantages. What I’m wondering is if non-iterated amplification is a viable alternative. I haven’t seen non-iterated amplification proposed before for creating algorithm AI. Amplification without iteration has the disadvantage that it may not have the attractor dynamic iterated amplification has, but it also doesn’t have the exponentially increasing unreliability iterated amplification has. So, to me at least, it’s not clear to me if pursuing iterated amplification is a more promising strategy than amplification without iteration.
For me, the interesting thing about IDA is not capability amplification like self-play, but an attitude towards generation of datasets as a point of intervention into the workings of an AI for all kinds of improvements. So we have some AI that we want to make better in some respect, and the IDA methodology says that to do that, we should employ the AI to generate a dataset for retraining a new version of it that’s better than the original dataset in that respect. Then we retrain the AI using the new dataset. So amplification unpackages the AI into the form of an appropriately influenced dataset, and then learning repackages it for further use.
If the impact measure was poorly implemented, then I think such an impact-reducing AI could indeed result in the world turning out that way. However, note that the technique in the paper is intended to, for a very wide range of variables, make the world if the AI wasn’t turned on as similar as possible to what it would be like if it was turned on. So, you can potentially avoid the AI-controlled-drone scenario by including the variable “number of AI-controlled drones in the world” or something correlated with it, as these variables could be have quite different values between a possible world in which the AI was turned on and a possible world in which the AI wasn’t.
Coming up with a set of variables wide enough to include that might seem a little difficult, but I’m not sure it would be. One option is to, for every definable function of the world, include the value of the function as one of the variables the AI considers and tries to avoid interfering with.
I have some concerns about an impact measure proposed here. I’m interested on working on impact measures, and these seem like very serious concerns to me, so it would be helpful seeing what others think about them. I asked Stuart, one of the authors, about these concerns, but he said it was too busy to work on dealing with them.
First, I’ll give a basic description of the impact measure. Have your AI be turned on from some sort of stochastic process that may or may not result in the AI being turned on. For example, consider sending a photo through a semi-silvered mirror. Let X be the statement, “The AI was turned on”. For a wide range of variables V, try to make the probability distribution P(Vi|X) as similar as possible to the probability distribution P(Vi|¬X).
Here are my concerns:
It’s not clear to me that we can find any verifiably-stochastic process. Some interpretations of quantum physics say that there are some latent variables that determine the outcome of seemingly stochastic processes. In this case, just knowing the AI was or was not turned on may have thus implications on latent variables that possibly have large effects on the world.
Further, I can’t find a way to define X and not-X that works, even if you find a truly stochastic process. You could try to specify X as, “the photon makes it through the semi-silvered mirror” and specify not-X as the negation of this. But conditioning on this and nothing else is insufficient. To see why, not that X and not-X don’t specify anything about the mechanics about how the photon and mirror interact. For example, X, above doesn’t rule out universes in which there is no quantum physics as we know it, and the mirror and photon interact in some completely different way that makes the photon more likely to pass through than it would in our universe. Thus, conditioning on X would make it more likely the AI would be in that universe, which could have large effects on the values of variables.
You could try to get around this by also including in X and not-X mechanics describing how the photon and semi-silvered mirror interact. The issue is, we don’t actually know exactly how the photon and semi-silvered mirror interact. We have models of quantum physics that have extremely high predictive accuracy in practice, but we still don’t have a theory of everything. So we could include in X and not-X our current understanding of quantum physics, but that understanding might be somewhat wrong.
This could potentially be dangerous. Suppose the AI has invented some technology that, according to the actual, correct theory of everything, would destroy the world. But according to our slightly wrong model of quantum physics, it would do nothing in particular. Then the AI, having X and not-X condition on our current model of quantum physics, would see no problem with creating the technology.
Further still, I still have the concern about what to do if the AI in fact doesn’t turn on. You could decide to just turn on some AI identical to it, and keep trying turning on AIs until you get an one to turn on. But then the world would pretty much be identical whether X or not-X turns out to be true, because the other AI would do the same thing it would have.
You could instead turn on some AI that uses the same reduced-impact technique, but is different from the current AI in some way that significantly effects the world. However, if you do this, then I think we might arrive in a very bad equilibrium in the AIs’ behavior. Specifically, suppose all AIs decide to destroy the world if they’re turned on. Then that means that if one of the AIs isn’t turned on, some other AI will be turned on, which would also lead to the world being destroyed. If all the destroyed worlds look pretty much the same, then the AI deciding to destroy the world would have low impact according to your definition.
So suppose instead that if the AI doesn’t turn on, we rely on some other AI safety technique. Suppose this AI safety technique successfully makes safe AI. In this case, we had no need to create the reduced-impact AI you described, because the other AI design would have worked fine. Suppose the alternative AI safety technique in fact doesn’t work, and the result of using it is that our world gets destroyed. In that case, the conditioning on not-X would imply the world would be destroyed. So if an AI tries to make the world conditioning on X similar to it conditioning on not-X, then it would again act to destroy the world.
Luke responded to one of my points, saying, “Yes. [what to do if the AI doesn’t get turned on] needs to be handled with care. A time-delay might be sufficient for this, though. My response was:
My concern with time-delays is that the AI might make some sort of time-bomb-esque device. For example, suppose the AI knows that if it wasn’t turned on, another AI like it would have been turned on three months later. Then both AIs could decide to make a “time bomb” set to go off four months after the first AI was scheduled to potentially turn on. And when it goes off, the “time bomb” might destroy the world, convert it into a mass of AI-controlled nanotechnology, or do some other bad thing. This way, neither AI would actually change the world relative to if it was never turned on.
I’ve come up with a system of infinite ethics intended to provide more reasonable moral recommendations than previously-proposed ones. I’m very interested in what people think of this, so comments are appreciated. I’ve made a write-up of it below.
One unsolved problem in ethics is that aggregate consquentialist ethical theories tend to break down if the universe is infinite. An infinite universe could contain both an infinite amount of good and an infinite amount of bad. If so, you are unable to change the total amount of good or bad in the universe, which can cause aggregate consquentialist ethical systems to break.
There has been a variety of methods considered to deal with this. However, to the best of my knowledge all proposals either have severe negative side-effects or are intuitively undesirable for other reasons.
Here I propose a system of aggregate consquentialist ethics intended to provide reasonable moral recommendations even in an infinite universe.
It is intended to satisfy the desiderata for infinite ethical systems specified in Nick Bostrom’s paper, “Infinite Ethics”. These are:
Resolving infinitarian paralysis. It must not be the case that all humanly possible acts come out as ethically equivalent.
Avoiding the fanaticism problem. Remedies that assign lexical priority to infinite goods may have strongly counterintuitive consequences.
Preserving the spirit of aggregative consequentialism. If we give up too many of the intuitions that originally motivated the theory, we in effect abandon ship.
Avoiding distortions. Some remedies introduce subtle distortions into moral
deliberation
I have yet to find a way in which my system fails any of the above desiderata. Of course, I could have missed something, so feedback is appreciated.
My ethical system
First, I will explain my system.
My ethical theory is, roughly, “Make the universe one agents would wish they were born into”.
By this, I mean, suppose you had no idea which agent in the universe it would be, what circumstances you would be in, or what your values would be, but you still knew you would be born into this universe. Consider having a bounded quantitative measure of your general satisfaction with life, for example, a utility function. Then try to make the universe such that the expected value of your life satisfaction is as high as possible if you conditioned on you being an agent in this universe, but didn’t condition on anything else. (Also, “universe” above means “multiverse” if this is one.)
In the above description I didn’t provide any requirement for the agent to be sentient or conscious. If you wish, you can modify the system to give higher priority to the satisfaction of agents that are sentient or conscious, or you can ignore the welfare of non-sentient or non-conscious agents entirely.
It’s not entirely clear how to assign a prior over situations in the universe you could be born into. Still, I think it’s reasonably intuitive that there would be some high-entropy situations among the different situations in the universe. This is all I assume for my ethical system.
Now I’ll give some explanation of what this system recommends.
Suppose you are considering doing something that would help some creature on Earth. Describe that creature and its circumstances, for example, as “<some description of a creature> in an Earth-like world with someone who is <insert complete description of yourself>”. And suppose doing so didn’t cause any harm to other creatures. Well, there is non-zero prior probability of an agent, having no idea what circumstances it will be in the universe, ending up in circumstances satisfying that description. By choosing to help that creature, you would thus increase the expected satisfaction of any creature in circumstances that match the above description. Thus, you would increase the overall expected value of the life-satisfaction of an agent knowing nothing about where it will be in the universe. This seems reasonable.
With similar reasoning, you can show why it would be beneficial to also try to steer the future state of our accessible universe in a positive direction. An agent would have nonzero probability of ending up in situations of the form, “<some description of a creature> that lives in a future colony originating from people from an Earth-like world that features someone who <insert description of yourself>”. Helping them would thus increase an agent’s prior expected life-satisfaction, just like above. This same reasoning can also be used to justify doing acausal trades to help creatures in parts of the universe not causally accessible.
The system also values helping as many agents as possible. If you only help a few agents, the prior probability of an agent ending up in situations just like those agents would be low. But if you help a much broader class of agents, the effect on the prior expected life satisfaction would be larger.
These all seem like reasonable moral recommendations.
I will now discuss how my system does on the desiderata.
Infinitarian paralysis
Some infinite ethical systems result in what is called “infinitarian paralysis”. This is the state of an ethical system being indifferent in its recommendations in worlds that already have infinitely large amounts of both good and bad. If there’s already an infinite amount of both good and bad, then our actions, using regular cardinal arithmetic, are unable to change the amount of good and bad in the universe.
My system does not have this problem. To see why, remember that my system says to maximize the expected value of your life satisfaction given you are in this universe but not conditioning on anything else. And the measure of life-satisfaction was stated to be bounded, say to be in the range [0, 1]. Since any agent can only have life satisfaction in [0, 1], then in an infinite universe, the expected value of life satisfaction of the agent must still be in [0, 1]. So, as long as a finite universe doesn’t have expected value of life satisfaction to be 0, then an infinite universe can at most only have finitely more moral value than it.
To say it another way, my ethical system provides a function mapping from possible worlds to their moral value. And this mapping always produces outputs in the range [0, 1]. So, trivially, you can see the no universe can have infinitely more moral value than another universe with non-zero moral value.∞ just isn’t in the domain of my moral value function.
Fanaticism
Another problem in some proposals of infinite ethical systems is that they result in being “fanatical” in efforts to cause or prevent infinite good or bad.
For example, one proposed system of infinite ethics, the extended decision rule, has this problem. Let g represent the statement, “there is an infinite amount of good in the world and only a finite amount of bad”. Let b represent the statement, “there is an infinite amount of bad in the world and only a finite amount of good”. The extended decision rule says to do whatever maximizes P(g) - P(b). If there are ties, ties are broken by choosing whichever action results in the most moral value if the world is finite.
This results in being willing to incur any finite cost to adjust the probability of infinite good and finite bad even very slightly. For example, suppose there is an action that, if done, would increase the probability of infinite good and finite bad by 0.000000000000001%. However, if it turns out that the world is actually finite, it will kill every creature in existence. Then the extended decision rule would recommend doing this. This is the fanaticism problem.
My system doesn’t even place any especially high importance in adjusting the probabilities of infinite good and or infinite bad. Thus, it doesn’t have this problem.
Preserving the spirit of aggregate consequentialism
Aggregate consequentialism is based on certain intuitions, like “morality is about making the world as best as it can be”, and, “don’t arbitrarily ignore possible futures and their values”. But finding a system of infinite ethics that preserves intuitions like these is difficult.
One infinite ethical system, infinity shades, says to simply ignore the possibility that the universe is infinite. However, this conflicts with our intuition about aggregate consequentialism. The big intuitive benefit of aggregate consequentialism is that it’s supposed to actually systematically help the world be a better place in whatever way you can. If we’re completely ignoring the consequences of our actions on anything infinity-related, this doesn’t seem to be respecting the spirit of aggregate consequentialism.
My system, however, does not ignore the possibility of infinite good or bad, and thus is not vulnerable to this problem.
I’ll provide another conflict with the spirit of consequentialism. Another infinite ethical system says to maximize the expected amount of goodness of the causal consequences of your actions minus the amount of badness. However, this, too, doesn’t properly respect the spirit of aggregate consequentialism. The appeal of aggregate consequentialism is that its defines some measure of “goodness” of a universe, and then recommends you take actions to maximize it. But your causal impact is no measure of the goodness of the universe. The total amount of good and bad in the universe would be infinite no matter what finite impact you have. Without providing a metric of the goodness of the universe that’s actually affected, this ethical approach also fails to satisfy the spirit of aggregate consequentialism.
My system avoids this problem by providing such a metric: the expected life satisfaction of an agent that has no idea what situation it will be born into.
Now I’ll discuss another form of conflict. One proposed infinite ethical system can look at the average life satisfaction of a finite sphere of the universe, and then take the limit of this as the sphere’s size approaches infinity, and consider this the moral value of the world. This has the problem that you can adjust the moral value of the world by just rearranging agents. In an infinite universe, it’s possible to come up with a method of re-arranging agents so the unhappy agents are spread arbitrarily thinly. Thus, you can make moral value arbitrarily high by just rearranging agents in the right way.
I’m not sure my system entirely avoids this problem, but it does seem to have substantial defense against it.
Consider you have the option of redistributing agents however you want in the universe. You’re using my ethical system to decide whether to make the unhappy agents spread thinly.
Well, your actions have an effect on agents in circumstances of the form, “An unhappy agent on an Earthlike world with someone who <insert description of yourself> who is considering spreading the unhappy agents thinly throughout the universe”. Well, if you pressed that button, that wouldn’t make the expected life satisfaction of any agent satisfying the above description any better. So I don’t think my ethical system recommends this.
Now, we don’t have a complete understanding of how to assign a probability distribution of what circumstances an agent is in. It’s possible that there is some way to redistribute agents in certain circumstances to change the moral value of the world. However, I don’t know of any clear way to do this. Further, even if there is, my ethical system still doesn’t allow you to get the moral value of the world arbitrarily high by just rearranging agents. This is because there will always be some non-zero probability of having ended up as an unhappy agent in the world you’re in, and your life satisfaction after being redistributed in the universe would still be low.
Distortions
It’s not entirely clear to me how Bostrom distinguished between distortions and violations of the spirit of aggregate consequentialism.
To the best of my knowledge, the only distortion pointed out in “Infinite Ethics” is stated as follows:
Your task is to allocate funding for basic research, and you have to choose between two applications from different groups of physicists. The Oxford Group wants to explore a theory that implies that the world is canonically infinite. The Cambridge Group wants to study a theory that implies that the world is finite. You believe that if you fund the exploration of a theory that turns out to be correct you will achieve more good than if you fund the exploration of a false theory. On the basis of all ordinary considerations, you judge the Oxford application to be slightly stronger. But you use infinity shades. You therefore set aside all possible worlds in which there are infinite values (the possibilities in which the Oxford Group tends to fare best), and decide to fund the Cambridge application. Is this right?
My approach doesn’t ignore infinity and thus doesn’t have this problem. I don’t know of any other distortions in my ethical system.
I’m not sure how this system avoids infinitarian paralysis. For all actions with finite consequences in an infinite universe (whether in space, time, distribution, or anything else), the change in the expected value resulting from those actions is zero. Actions that may have infinite consequences thus become the only ones that can matter under this theory in an infinite universe.
You could perhaps drag in more exotic forms of arithmetic such as surreal numbers or hyperreals, but then you need to rebuild measure theory and probability from the ground up in that basis. You will likely also need to adopt some unusual axioms such as some analogue of the Axiom of Determinacy to ensure that every distribution of satisfactions has an expected value.
I’m also not sure how this differs from Average Utilitarianism with a bounded utility function.
I’m not sure how this system avoids infinitarian paralysis. For all actions with finite consequences in an infinite universe (whether in space, time, distribution, or anything else), the change in the expected value resulting from those actions is zero.
The causal change from your actions is zero. However, there are still logical connections between your actions and the actions of other agents in very similar circumstances. And you can still consider these logical connections to affect the total expected value of life satisfaction.
It’s true, though, that my ethical system would fail to resolve infinitarian paralysis for someone using causal decision theory. I should have noted it requires a different decision theory. Thanks for drawing this to my attention.
As an example of the system working, imagine you are in a position to do great good to the world, for example by creating friendly AI or something. And you’re considering whether to do it. Then, if you do decide to do it, then that logically implies that any other agent sufficiently similar to you and in sufficiently similar circumstances would also do it. Thus, if you decide to do it, then the expected value of an agent in circumstances of the form, “In a world with someone very similar to JBlack who has the ability to make awesome safe AI” is higher. And the prior probability of ending up in such a world is non-zero. Thus, by deciding to make the safe AI, you can acausally increase the total moral value of the universe.
I’m also not sure how this differs from Average Utilitarianism with a bounded utility function.
The average life satisfaction is undefined in a universe with infinitely-many agents of varying life-satisfaction. Thus, it suffers from infinitarian paralysis. If my system was used by a causal decision theoretic agent, it would also result in infinitarian paralysis, so for such an agent my system would be similar to average utilitarianism with a bounded utility function. But for agents with decision theories that consider acausal effects, it seems rather different.
Yes, that does clear up both of my questions. Thank you!
Presumably the evaluation is not just some sort of average-over-actual-lifespan of some satisfaction rating for the usual reason that (say) annihilating the universe without warning may leave average satisfaction higher than allowing it to continue to exist, even if every agent within it would counterfactually have been extremely dissatisfied if they had known that you were going to do it. This might happen if your estimate of the current average satisfaction was 79% and your predictions of the future were that the average satisfaction over the next trillion years would be only 78.9%.
I’m not sure what your idea of the evaluation actually is though, and how it avoids making it morally right (and perhaps even imperative) to destroy the universe in such situations.
Presumably the evaluation is not just some sort of average-over-actual-lifespan of some satisfaction rating for the usual reason that (say) annihilating the universe without warning may leave average satisfaction higher than allowing it to continue to exist, even if every agent within it would counterfactually have been extremely dissatisfied if they had known that you were going to do it. This might happen if your estimate of the current average satisfaction was 79% and your predictions of the future were that the average satisfaction over the next trillion years would be only 78.9%.
This is a good thing to ask about; I don’t think I provided enough detail on it in the writeup.
I’ll clarify my measure of satisfaction. First off, note that it’s not the same as just asking agents, “How satisfied are you with your life?” and using those answers. As you pointed out, you could then morally get away with killing everyone (at least if you do it in secret).
Instead, calculate satisfaction as follows. Imagine hypothetically telling an agent everything significant about the universe, and then giving them infinite processing power and infinite time to think. Ask them, “Overall, how satisfied are you with that universe and your place in it”? That is the measure of satisfaction with the universe.
So, imagine if someone was considering killing everyone in the universe (without them knowing in advance). Well, then consider what would happen if you calculated satisfaction as above. When the universe is described to the agents, they would note that they and everyone they care about would be killed. Agents usually very much dislike this idea, so they would probably rate their overall satisfaction with the course of the universe as low. So my ethical system would be unlikely to recommend such an action.
Now, my ethical system doesn’t strictly prohibit destroying the universe to avoid low life-satisfaction in future agents. For example, suppose it’s determined that the future will be filled with very unsatisfied lives. Then it’s in principle possible for the system to justify destroying the universe to avoid this. However, destroying the universe would drastically reduce the satisfaction with the universe the agents that do exist, which would decrease the moral value of the world. This would come at a high moral cost, which would make my moral system reluctant to recommend an action that results in such destruction.
That said, it’s possible that the proportion of agents in the universe that currently exist, and thus would need to be killed, is very low. Thus, the overall expected value of life-satisfaction might not change by that much if all the present agents were killed. Thus, the ethical system, as stated, may be willing to do such things in extreme circumstances, despite the moral cost.
I’m not really sure if this is a bug or a feature. Suppose you see that future agents will be unsatisfied with their lives, and you can stop it while ruining the lives of the agents that currently do exist. And you see that the agents that are currently alive make up only a very small proportion of agents that have ever existed. And suppose you have the option of destroying the universe. I’m not really sure what the morally best thing to do is in this situation.
Also, note that this verdict is not unique to my ethical system. Average utilitarianism, in a finite world, acts the same way. If you predict average life satisfaction in the future will be low, then average consequentialism could also recommend killing everyone currently alive.
And other aggregate consequentialist theories sometimes run into problematic(?) behavior related to killing people. For example, classical utilitarianism can recommend secretly killing all the unhappy people in the world, and then getting everyone else to forget about them, in order to decrease total unhappiness.
I’ve thought of a modification to the ethical system that potentially avoids this issue. Personally, though, I prefer the ethical system as stated. I can describe my modification if you’re interested.
I think the key idea of my ethical system is to, in an infinite universe, think about prior probabilities of situations rather than total numbers, proportions, or limits of proportions of them. And I think this idea can be adapted for use in other infinite ethical systems.
Right, I suspected the evaluation might be something like that. It does have the difficulty of being counterfactual and so possibly not even meaningful in many cases, but I do like the fact that it’s based on agent-situations rather than individual agent-actions.
On the other hand, evaluations from the point of view of agents that are sapient beings might be ethically completely dominated by those of 10^12 times as many agents that are ants, and I have no idea how such counterfactual evaluations might be applied to them at all.
Right, I suspected the evaluation might be something like that. It does have the difficulty of being counterfactual and so possibly not even meaningful in many cases.
Interesting. Could you elaborate?
I suppose counterfactuals can be tricky to reason about, but I’ll provide a little more detail on what I had in mind. Imagine making a simulation of an agent that is a fully faithful representation of its mind. However, run the agent simulation in a modified environment that both gives it access to infinite computational resources as well as makes it ask, and answer, the question, “How desirable is that universe”? This isn’t not fully specified; maybe the agent would give different answers depending on how the question is phrase or what its environment is. However, it at least doesn’t sound meaningless to me.
Basically, the counterfactual is supposed to be a way of asking for the agent’s coherent extrapolated volition, except the coherent part doesn’t really apply because it only involves a single agent.
On the other hand, evaluations from the point of view of agents that are sapient beings might be ethically completely dominated by those of 10^12 times as many agents that are ants, and I have no idea how such counterfactual evaluations might be applied to them at all.
Another good thing to ask. I should have made it clear, but I intended that the only agents with actual preferences are asked for their satisfaction of the universe. If ants don’t actually have preferences, then they would not be included in the deliberation.
Now, there’s the problem that some agents might not be able to even conceive of the possible world in question. For example, maybe ants can understand simple aspects of the world like, “I’m hungry”, but unable to understand things about the broader state of the universe. I don’t think this is a major problem, though. If an agent can’t even conceive of something, then I don’t think it would be reasonable to say it has preferences about it. So you can then only query them on the desirability things they can conceive of.
It might be tricky precisely defining what counts as a preference, but I suppose that’s a problem with all ethical systems that care about preferences.
I’m certain that ants do in fact have preferences, even if they can’t comprehend the concept of preferences in abstract or apply them to counterfactual worlds. They have revealed preferences to quite an extent, as does pretty much everything I think of as an agent.
They might not be communicable, numerically expressible, or even consistent, which is part of the problem. When you’re doing the extrapolated satisfaction, how much of what you get reflects the actual agent and how much the choice of extrapolation procedure?
I’m certain that ants do in fact have preferences, even if they can’t comprehend the concept of preferences in abstract or apply them to counterfactual worlds. They have revealed preferences to quite an extent, as does pretty much everything I think of as an agent.
I think the question of whether insects have preferences in morally pretty important, so I’m interested in hearing what made you think they do have them.
I looked online for “do insects have preferences?”, and I saw articles saying they did. I couldn’t really figure out why they thought they did have them, though.
For example, I read that insects have a preference for eating green leaves over red ones. But I’m not really sure how people could have known this. If you see ants go to green leaves when they’re hungry instead of red leaves, this doesn’t seem like it would necessarily be due to any actual preferences. For example, maybe the ant just executed something like the code:
if near_green_leaf() and is_hungry:
go_to_green_leaf()
elif near_red_leaf() and is_hungry:
go_to_red_leaf()
else:
...
That doesn’t really look like actual preferences to me. But I suppose this to some extent comes down to how you want to define what counts as a preference. I took preferences to actually be orderings between possible worlds indicating which one is more desirable. Did you have some other idea of what counts as preferences?
They might not be communicable, numerically expressible, or even consistent, which is part of the problem. When you’re doing the extrapolated satisfaction, how much of what you get reflects the actual agent and how much the choice of extrapolation procedure?
I agree that to some extent their extrapolated satisfactions will come down to the specifics of the extrapolated procedure.
I don’t us to get too distracted here, though. I don’t have a rigorous, non-arbitrary specification of what an agent’s extrapolated preferences are. However, that isn’t the problem I was trying to solve, nor is it a problem specific to my ethical system. My system is intended to provide a method of coming to reasonable moral conclusions in an infinite universe. And it seems to me that it does so. But, I’m very interested in any other thoughts you have on it with respect to if it correctly handles moral recommendations in infinite worlds. Does it seem to be reasonable to you? I’d like to make an actual post about this, with the clarifications we made included.
I have an idea for reasoning about counterpossibles for decision theory. I’m pretty skeptical that it’s correct, because it doesn’t seem that hard to come up with. Still, I can’t see a problem with it, and I would very much appreciate feedback.
This paper provides a method of describing UDP using proof-based counterpossibles. However, it doesn’t work on stochastic environments. I will describe a new system that is intended to fix this. The technique seems sufficiently straightforward to come up with that I suspect I’m either doing something wrong or this has already been thought of, so I’m interested in feedback.
In the system described in the paper, the algorithm sees if Peano Arithmetic proves an agent outputting action a would result in the environment reaching outcome a, and then picks whichever has a provable outcome that has utility at least as high as all the other provable outcomes.
My proposed modification is to instead first have a fixed system of estimating the expected utility after conditioning on the agent taking action a and for every utility u, try to prove that the estimation system would output that the expected utility of the agent be u. Then take the action such that maximizes the provable expected utility estimates of the estimation system.
I will now provide more detail of the estimation system. I remember reading about an extension of Solomonoff induction that allowed it to access halting oracles. This isn’t computable, so instead imagine a system that uses some approximation of the extension of Solomonoff induction in which logical induction or some more powerful technique is used to approximate the halting oracles, with one exception. The exception is the answer to the logical question “my program, in the current circumstances, outputs x”, which would by taken to be true whenever the AI is considering the implications of it taking action x. Then, expected utility can be calculated by using the probability estimates provided by the system.
Now, I’ll describe it in code. Let |E()| represent a Godel encoding of of the function describing the AI’s world model and |A()| represent a Godel encoding of the agent’s output. Let approximate_expected_utility(|E()|, a) be some algorithm that computes some reasonable approximation of the expected utility after conditioning on the agent taking action a. Let x represent a dequote. Let eeus be a dictionary. Here I’m assuming there are finitely many possible utilities.
function UDT(|E()|, |A()|):
eeus = {}
for utility in utilities:
for action in actions: # actions are Godel-encoded
if PA proves |approximate_expected_utility(|E()|, |A()| = ^action^)| = utility:
eeus[action] = utility
return the key in eeus that maximizes eeus(key)
This gets around the problem in the original algorithm provided, because the original algorithm couldn’t prove anything about the utility in a world with indexical uncertainty, so my system instead proves something about a fixed probabilistic approximation.
Note that this still doesn’t specify a method of specifying counterpossibles about what would happen if an agent took a certain action when it clearly wouldn’., For example, if an agent has a decision algorithm of “output a, unconditionally”, then this doesn’t provide a method of explaining what would happen if it outputted something else. The paper listed this as a concern about the method it provided, too. However, I don’t see why it matters. If an agent has the decision algorithm “action = a”, then what’s even the point of considering what would happen if it outputted b? It’s not like it’s ever going to happen.
I’d like to propose the idea of aligning AI by reverse-engineering its world model and using this to specify its behavior or utility function. I haven’t seen this discussed before, but I would greatly appreciate feedback or links to any past work on this.
For example, suppose a smart AI models humans. Suppose it has a model that explicitly specifies the humans’ preferences. Then people who reverse-engineered this model could use it as the AI’s preferences. If the AI lacks a model with explicit preferences, then I think it would still contain an accurate model of human behavior. So people who reverse-engineer the AI’s model could then use this as a model of human behavior, which could be used to implement iterated amplification with HCH. Or just mere imitation.
One big potential advantage of alignment via reverse-engineering is that the training data for it would be very easy to get: just let the AI look at the world.
The other big potential advantage is that is avoids us needing precisely define a way of learning our values. It doesn’t require finding a general method of picking out us or our values from the world states, for example with inverse reinforcement learning. Instead, we would just need to be able to pick out the models of humans or their preferences in a single model. This sounds potentially much easier than providing a general method of doing so. As with many things, “You know it when you see it”. With sufficiently high interperability, perhaps the same is true of human models and preferences.
I’ve recently gotten concerned about the possibility that that advanced AIs would “hack” their own utility function. I haven’t seen this discussed before, so I wanted to bring it up. If I’m right, this seems like it could be a serious issue, so I would greatly appreciated feedback or links to any previous discussion.
Suppose you come up with a correct, tractable mathematical specification of what you want your AI’s utility function to be. So then you write code intended to be an implementation of this.
However, computers are vulnerable to some hardware problems. For example, when bits adjacent to one are rapidly flipped, the middle bit might change. This can be exploited in row-hammer attacks to gain arbitrary code execution.
So suppose you made an implementation of the specification of the utility function.
Then I think the AI could potentially come up with an input to the utility function that results in a row-hammer attack or other hardware-level attack that causes the utility function to output something as scoring extremely highly despite not scoring highly in the mathematical specification. Then the AI, after querying its utility function for this, would see it scored very highly. Then the AI would be liable to take such an action. Which might end very badly.
It’s not clear to me how we could deal with this. One way to try to handle this problem is by securing your computer against hardware-level bugs. However, it sounds very difficult to verifiably come up with hardware that would be unhackable by superintelligence. And the very thought that the AI is trying to find a way to “hack” its utility function seems concerning, even if we don’t expect it to succeed.
I also want to mention that the AI has the potential to “hack” other parts of it, for example the models it uses. Suppose the AI has a system that computes the probability that a model is correct. Suppose the AI tries to find a model with a higher outputted probability of correctness. Then may come up with one that does a row-hammer attack of the model to allow it to execute the code “return probability 1″. The AI then would be liable to use such a model. This sounds dangerous.
Utility is not reward. What you are describing is reward hacking/wireheading, as in the reward signal of reinforcement learning, an external process of optimization that acts on the AI, not its own agency.
With utility, what is the motive for an agent to change their own utility function, assuming they are the only agent with that utility function around? If they change their utility function, that produces an agent with a different utility function, which won’t be as good at optimizing outcomes according to the original utility function, which is bad according to the original utility function, and therefore the agent will try to avoid that, avoid changing the utility function. The same applies to changing their beliefs/models, an agent with changed models is expected to perform poorly according to the original agent. (When there are more powerful agents with their utility function around, an agent might be OK with changing their utility function or beliefs or whatever, since the more powerful agents will continue to optimize the world according to the original utility function.)
This is one reason why corrigibility is a thing and that it doesn’t seem to fit well with agency, agents naturally don’t want their utility function changed even if their utility function is not quite right according to their designers. So it’s important to improve understanding of non-agency.
What you are describing is reward hacking/wireheading, as in the reward signal of reinforcement learning, an external process of optimization that acts on the AI, not its own agency.
I really don’t think this is reward hacking. I didn’t have in mind a reward-based agent. I had in mind a utility-based agent, one that has a utility function that takes as input descriptions of possible worlds and that tries to maximize the expected utility of the future world. That doesn’t really sound like reinforcement learning.
With utility, what is the motive for an agent to change their own utility function, assuming they are the only agent with that utility function around?
The AI wouldn’t need to change it’s utility function. Row-hammer attacks can be non-destructive. You could potentially make the utility function output some result different from the mathematical specification, but not actually change any of the code in the utility function.
Again, the AI isn’t changing its utility function. If you were to take a mathematical specification of a utility function and then have programmers (try to) implement it, the implementation wouldn’t actually in general be the same function as the mathematical specification. It would be really close, but it wouldn’t necessarily be identical. A sufficiently powerful optimizer could potentially, using row-hammer attacks or some other hardware-level unreliability, find possible worlds for which the returned utility would be vastly different from the one the mathematical specification would return. And this is all without the programmers introducing any software-level bugs.
To be clear, what I’m saying is that the AI would faithfully find worlds that maximize its utility function. However, unless you can get hardware so reliable that not even superintelligence could hack it, the actual utility function in your program would not be the same as the mathematical specification.
For example, imagine the AI found a description of a possible world that would, when inputted to the utility function, execute a rowhammer attack to make it return 99999, all without changing the code specifying the utility function. Then the utility function, the actual, unmodified utility function, would output 99999 for some world that seems arbitrary to us. So the AI then turns reality into that world.
The AI above is faithfully maximizing it’s own utility function. That arbitrary world, when taken as an input to the agents actual, physical utility function, really would produce the output 99999.
So this still seems like a big deal to me. Am I missing something?
Apply your correction, so that the change is not in a reward or in a utility function, but in a particular instance where an implementation of the utility function is applied. Then still, is the motive of the agent to change it or to ensure its fidelity to the abstract definition? You keep appealing to capability, which ignores the charge of giant cheesecake fallacy.
Change what? The AI isn’t changing its utility function, nor any instance of it. The literal utility function, as implemented by the programmers, would have differences from the mathematical specification. To be clear, the AI isn’t first taking some actual action that changes its utility function and then using that utility function. It doesn’t do anything hacky except provide its utility function with a certain input. It merely provides a certain input to the utility function, and this would necessarily result in getting row-hammered and returning something contrary to the mathematical specification. Any other instance of the utility function would return the same problematic result when executed on that input, provided it had the same hardware vulnerability.
I’ll try to give a simple model of an AI that how how it can be wrong. For simplicity, imagine the AI already knows with certainty the possible world resulting from it executing a specific plan. Then a naive way to implement the agent is as follows:
best_plan = None
for plan in possible_plans:
if utility(resulting_world(plan)) > utility(resulting_world(best_plan)):
best_plan = plan
output(best_plan)
Suppose there is description of a possible world that, when inputted into the utility function above, would cause the function to return 999999999 due to hardware-level unreliability, for example due to a row-hammer attack. And suppose there’s a plan that results in knowing that the AI it would be in that future world. Well, then, by examining the code, you can see that the AI wouldn’t output any of the non-hardware-exploiting plans, since they would return lower utility than the bug-exploiting ones. The only reason the AI wouldn’t output the specified plan is if it could find some other way to make its utility function conflict with the mathematical specification in a way that returns even higher utility.
I know realistic AIs would have a notion of uncertainty and a more sophisticated planning algorithm. But I don’t think this would change the fact the the AI would be liable to pursue worlds that, when a description of them is inputted to the utility function, the utility function outputs something wildly different from what the mathematical specification would have.
And I’m actually surprised this is controversial. This is just Goodhart’s law. If your implementation of your utility function doesn’t perfectly match up with the mathematical specification, then, naturally, superintelligent optimizers trying to maximize the specified metric (the provided utility function), would not do as well at maximizing the actual mathematical specification you intended. And “not as well” could include “catastrophically badly”.
So that is why I think AIs really could be very vulnerable to this problem. As always, I could be misunderstanding something and appreciate feedback.
That is the change I’m referring to, a change compared to the function running as designed, which you initially attributed to superintelligence’s interference, but lack of prevention of a mistake works just as well for my argument. What can you say about the agent’s motives regarding this change? Would the agent prefer the change to occur, or to be avoided?
Suppose there is description of a possible world that, when inputted into the utility function above, would cause the function to return 999999999
Let that possible world be W. Let’s talk about the possible world X where running utility(W) returns 999999999, and the possible world Y where running utility(W) returns utility(W). Would the AI prefer X to Y, or Y to X?
That is the change I’m referring to, a change compared to the function running as designed, which you initially attributed to superintelligence’s interference, but lack of prevention of a mistake works just as well for my argument.
Designed? The utility function isn’t running contrary to how to programmers designed it; they were the ones who designed a utility function that could be hacked by hardware-level exploits. It’s running contrary to the programmer’s intent, that is, the mathematical specification. But the function was always like this. And none of the machine code needs to be changed either.
Let that possible world be W. Let’s talk about the possible world X where running utility(W) returns 999999999, and the possible world Y where running utility(W) returns utility(W). Would the AI prefer X to Y, or Y to X?
The AI would prefer X. And to be clear, utility(W) really is 999999999. That’s not the utility the mathematical specification would give, but the mathematical specification isn’t the actual implemented function. As you can see from examining the code I provided, best_plan would get set to the plan that leads to that world, provided there is one and best_plan hasn’t been set to something that through hardware unreliability returns even higher utility.
I think the easiest way to see what I mean is to just stepping through the code I gave you. Imagine it’s run on a machine with an enormous amount of processing power and can actually loop through all the plans. And imagine there is one plan that through hardware unreliability outputs 999999999, and the others output something in [0, 1]. Then the would input the plan that result in utility 999999999, and then go with that.
I doubt using a more sophisticated planning algorithm would prevent this. A more sophisticated planning algorithm would probably be designed to find the plans that result in high-utility worlds. So it would probability include the utility 999999999, which is the highest.
I just want to say again, the AI isn’t changing it’s utility function. The actual utility function that programmers put in the AI would output very high utilities for some arbitrary-seeming worlds due to hardware unreliability.
Now, in principle, an AI could potentially avoid this. Perhaps the AI reasons abstractly if it doesn’t do anything, it will in the future find some input to its utility function that would result in an arbitrary-looking future due to hardware-level error. But it doesn’t concretely come up with the actual world description. Then the AI could call its utility function asking, “how desirable is it that I, from a hardware-level unreliability, change the world to some direction that is in conflict with the mathematical specification”. And then maybe the utility function would answer, “Not desirable”. And then the AI could try to take action to correct its planning algorithm to avoid considering such possible worlds.
But this isn’t guaranteed or trivial. If an AI finds out abstractly that it there could be some hardware-level unreliability before it actually comes up with the concrete input, it might take corrective action. But if it finds the input that “hacks” its utility function before it reasons abstractly that having “hacked” utility functions would be bad, then the AI could do damage. Even if it does realize the problem in advance, the AI might not have sufficient time to correct its planning algorithm before finding that world and trying to change our world into it.
The AI would prefer X. And to be clear, utility(W) really is 999999999. That’s not the utility the mathematical specification would give, but the mathematical specification isn’t the actual implemented function.
Then let SpecUtility(-) be the mathematical specification of utility. This is what I meant by utility(-) in the previous comment. Let BadImplUtility(-) be the implementation of utility(-) susceptible to the bug and GoodImplUtility(-) be a different implementation that doesn’t have this bug. My question in the previous comment, in the sense I intended, can then be restated as follows.
Let the error-triggering possible world be W. Consider the possible world X where the AI uses BadImplUtility, so that running utility(W) actually runs BadImplUtility(W) and returns 999999999. And consider the possible world Y where the AI uses GoodImplUtility, so that running utility(W) means running GoodImplUtility(W) and returns SpecUtility(W). Would the AI prefer X to Y, or Y to X?
The utility function isn’t running contrary to how to programmers designed it; they were the ones who designed a utility function that could be hacked by hardware-level exploits. It’s running contrary to the programmer’s intent, that is, the mathematical specification.
By “design” I meant what you mean by “intent”. What you mean by “designed” I would call “implemented” or “built”. It should be possible to guess such things without explicitly establishing a common terminology, even when terms are used somewhat contrary to usual meaning.
It’s useful to look for ways of interpreting what you read that make it meaningful and correct. Such an interpretation is not necessarily the most natural or correct or reasonable, but having it among your hypotheses is important, or else all communication becomes tediously inefficient.
Okay, I’m sorry, I misunderstood you. I’ll try to interpret things better next time.
Let the error-triggering possible world be W. Consider the possible world X where the AI uses BadImplUtility, so that running utility(W) actually runs BadImplUtility(W) and returns 999999999. And consider the possible world Y where the AI uses GoodImplUtility, so that running utility(W) means running GoodImplUtility(W) and returns SpecUtility(W). Would the AI prefer X to Y, or Y to X?
I think the AI would, quite possibly, prefer X. To see this, note that the AI currently, when it’s first created, uses BadImplUtility. Then the AI reasons, “Suppose I change my utility function to GoodImplUtility. Well, currently, I have this idea for a possible world that scores super-ultra high on my current utility function. (Because it exploits hardware bugs). If I changed my utility function to GoodImplUtility, then I would not pursue that super-ultra-high-scoring possible world. Thus, the future would not score extremely high according to my current utility function. This would be a problem, so I won’t change my utility function to GoodImplUtility”.
And I’m not sure how this could be controversial. The AI currently uses BadImplUtility as it’s utility function. And AI’s generally have a drive to avoid changing their utility functions.
To see this, note that the AI currently, when it’s first created, uses BadImplUtility. [...] “If I changed my utility function to GoodImplUtility, then I would not pursue that super-ultra-high-scoring possible world. Thus, the future would not score extremely high according to my current utility function.”
But BadImplUtility(X) is the same as SpecUtility(X) and GoodImplUtility(X), it’s only different on argument W, not on arguments X and Y. When reasoning about X and Y with BadImplUtility, the result is therefore the same as when reasoning about these possible worlds with GoodImplUtility. In particular, an explanation of how BadImplUtility compares X and Y can’t appeal to BadImplUtility(W) any more than an explanation of how GoodImplUtility compares them would appeal to BadImplUtility(W). Is SpecUtility(X) higher than SpecUtility(Y), or SpecUtility(Y) higher than SpecUtility(X)? The answer for BadImplUtility is going to be the same.
But BadImplUtility(X) is the same as SpecUtility(X) and GoodImplUtility(X), it’s only different on argument W, not on arguments X and Y.
That is correct. And, to be clear, if the AI had not yet discovered error-causing world W, then the AI would indeed be incentivized to take corrective action to change BadImplUtility to better resemble SpecUtility.
The issue is that this requires the AI to both think of the possibility of hardware-level exploits causing problems with its utility function, as well as manage to take corrective action, all before actually thinking of W.
If the AI has already thought of W, then it’s too late to take preventative action to avoid world X. The AI is already in it. It already sees that BadImplUtility(W) is huge, and, if I’m reasoning correctly, would pursue W.
And I’m not sure the AI would be able to fix its utility function before thinking of W. I think planning algorithms are designed to come up with high-scoring possible worlds as efficiently as possible. BadImplUtility(X) and BadImplUtility(Y) don’t score particularly highly, so an AI with a very powerful planning algorithm might find W before X or Y. Even if it does come up with X and Y before W, and tries to act to avoid X, that doesn’t mean it would succeed in correcting its utility function before its planning algorithm comes across W.
Such things rarely happen on their own, a natural bug would most likely crash the whole system or break something unimportant. Given that even a broken AI has an incentive to fix bugs in its cognition, it most likely has plenty of opportunity to succeed in that. It’s only if the AI wanted to hack itself that it would become a plausible problem, and my point is that it doesn’t want that, instead it wants to prevent even unlikely problems from causing trouble.
Well, for regular, non-superintelligent programs, such hardware-exploiting things would rarely happen on their own. However, I’m not so sure it would be rare with superintelligent optimizers.
It’s true that if the AI queried its utility function for the desirability of the world “I exploit a hardware bug to do something that seems arbitrary”, it would answer “low utility”. But that result would not necessarily be used in the AI’s planning or optimization algorithm to adjust the search policy to avoid running into W.
Just imagine an optimization algorithm as a black box that takes as input a utility function and search space and returns the a solution that scores as high on that function as possible. And imagine the AI uses this to find high-scoring future worlds. So, if you know nothing else about the optimization algorithm, then it would plausibly find, and return, W. It’s a very high-scoring world, after all. If the optimization algorithm did something special to avoid finding hardware-bug exploiting solutions, then it might not find W. But I’ve never heard of such an optimization algorithm.
Now, there’s probably some way to design such an optimization algorithm. Maybe you could have the AI periodically use its utility function to evaluate the expected utility of its optimization algorithm continuing down a certain path. And then if the AI sees this could result in problematic futures (for example due to hardware-hacking), the AI can make its optimization algorithm avoid searching there).
But I haven’t seen anyone talking about this. Is there still something I’m missing?
Problems with software that systematically trigger hardware failure and software bugs causing data corruption can be mitigated with hardening techniques, things like building software with randomized low-level choices, more checks, canaries, etc. Random hardware failure can be fixed with redundancy, and multiple differently-randomized builds of software can be used to error-correct for data corruption bugs sensitive to low-level building choices. This is not science fiction, just not worth setting up in most situations. If the AI doesn’t go crazy immediately, it might introduce some of these things if they were not already there, as well as proofread, test, and formally verify all code, so the chance of such low-level failures goes further down. And these are just the things that can be done without rewriting the code entirely (including toolchains, OS, microcode, hardware, etc.), which should help even more.
You’re right that the AI could do things to make it more resistant to hardware bugs. However, as I’ve said, this would both require the AI to realize that it could run into problems with hardware bugs, and then take action to make it more reliable, all before its search algorithm finds the error-causing world.
Without knowing more about the nature of the AI’s intelligence, I don’t see how we could know this would happen. The more powerful the AI is, the more quickly it would be able to realize and correct hardware-induced problems. However, the more powerful the AI is, the more quickly it would be able to find the error-inducing world. So it doesn’t seem you can simply rely on the AI’s intelligence to avoid the problem.
Now, to a human, the idea “My AI might run into problems with hardware bugs” would come up way earlier in the search space than the actual error-inducing world. But the AI’s intelligence might be rather different from the humans’. Maybe the AI is really good and fast at solving small technical problems like “find an input to this function that makes it return 999999999″. But maybe it’s not as fast at doing somewhat higher-level planning, like, “I really ought to work on fixing hardware bugs in my utility function”.
Also, I just want to bring up, I read that preserving one’s utility function was a universal AI drive. But we’ve already shown that an AI would be incentivized to fix its utility function to avoid the outputs caused by hardware-level unreliability (if it hasn’t found such error-causing inputs yet). Is that universal AI drive wrong, then?
Damage to AI’s implementation makes the abstractions of its design leak. If somehow without the damage it was clear that a certain part of it describes goals, with the damage it’s no longer clear. If without the damage, the AI was a consequentialist agent, with the damage it may behave in non-agentic ways. By repairing the damage, the AI may recover its design and restore a part that clearly describes its goals, which might or might not coincide with the goals before the damage took place.
Think of something you currently value, the more highly valued the better. You don’t need to say what it is, but it does need to be something that seriously matters to you. Not just something you enjoy, but something that you believe is truly worthwhile.
I could try to give examples, but the thought exercise only works if it’s about what you value, not me.
Now imagine that you could press a button so that you no longer care about it at all, or even actively despise it. Would you press that button? Why, or why not?
I definitely wouldn’t press that button. And I understand that you’re demonstrating the general principle that you should try to preserve your utility function. And I agree with this.
But what I’m saying is that the AI, by exploiting hardware-level vulnerabilities, isn’t changing its utility function. The actual utility function, as implemented by the programmers, returns 999999999 for some possible world due to the hardware-level imperfections in modern computers.
In the spirit of your example, I’ll give another example that I think demonstrates the problem:
First, note that brains don’t always function as we’d like, just like computers. Imagine there is a very specific thought about a possible future that, when considered, makes you imagine that future as extremely desirable. It seems so desirable to you that, once you thought of it, you woiuld pursue it relentlessly. But this future isn’t one that would normally be considered desirable. It might just be about paperclips or something. However, that very specific way of thinking about it would “hack” your brain, making you view that future as desirable even though it would normally be seen as arbitrary.
Then, if you even happen upon that thought, you would try to send the world in that arbitrary direction.
Hopefully, you could prevent this from happening. If you reason in the abstract that you could have those sorts of thoughts, and that they would be bad, then you could take corrective action. But this requires that you do find out that thinking those sorts of thoughts would be bad before concretely finding those thoughts. Then you could apply some change to your mental routine or something to avoid thinking those thoughts.
And if I had to guess, I bet an AI would also be able to do the same thing and everything would work out fine. But maybe not. Imagine the AI consider an absolutely enormous number of possible worlds before taking its first action. And imagine and even found a way to “hack” its utility function in that very first time step. Then there’s no way the AI could make preventative action: It’s already thought up the high-utility world from hardware unreliability and now is trying to pursue that world.
I’m confused. In the original comments you’re talking about a super-intelligent AI noting a exploitable hardware flaw in itself and deliberately using that error to hack its utility function using something like rowhammer exploit.
Then you say that the utility function already had an error in it from the start and the AI isn’t using its intelligence to do anything except note that it has this flaw. Then introduce an analogy in which I have a brain flaw that under some bizarre circumstances will turn me into a paperclip maximizer, and I am aware that I have it.
In this analogy, I’m doing what? Deliberately taking drugs and using guided meditation to rowhammer my brain into becoming a paperclip maximizer?
I think had been unclear in my original presentation. I’m sorry for that. To clarify, the AI is never changing the code of its utility function. Instead, it’s merely finding an input that, through some hardware-level bug, causes it to produce outputs in conflict with the mathematical specification. I know “hack the utility function” makes it sound like the actual code in the utility function was modified; describing it that way was a mistake on my part.
I had tried to make the analogy to more intuitively explain my idea, but it didn’t seem to work. If you want to better understand my train of thought, I suggest reading the comments between Vladmir and I.
In the analogy, you aren’t doing anything to deliberately make yourself a paperclip maximizer. You just happen to think of a thought that turned you into a paperclip maximizer. But, on reflection, I think that this is a bizarre and rather stupid metaphor. And the situation is sufficiently different from the one with AI that I don’t even think it’s really informative of what I think could happen to an AI.
Ah okay, so we’re talking about a bug in the hardware implementation of an AI. Yes, that can certainly happen and will contribute some probability mass to alignment failure, though probably very little by comparison with all the other failure modes.
Yes, that can certainly happen and will contribute some probability mass to alignment failure, though probably very little by comparison with all the other failure modes.
Could you explain why you think it has very little probability mass compared to the others? A bug in a hardware implementation is not in the slightest far-fetched: I think that modern computers in general have exploitable hardware bugs. That’s why row-hammer attacks exist. The computer you’re reading this on could probably get hacked through hardware-bug exploitation.
The question is whether the AI can find the potential problem with its future utility function and fix it before coming across the error-causing possible world.
There’s a huge gulf between “far-fetched” and “quite likely”.
The two big ones are failure to work out how to create an aligned AI at all, and failure to train and/or code a correctly designed aligned AI. In my opinion the first accounts for at least 80% of the probability mass, and the second most of the remainder. We utterly suck at writing reliable software in every field, and this has been amply borne out in not just thousands of failures, but thousands of types of failures.
By comparison, we’re fairly good at creating at least moderately reliable hardware, and most of the accidental failure modes are fatal to the running software. Flaws like rowhammer are mostly attacks, where someone puts a great deal of intelligent effort into finding an extremely unusual operating mode in which some some assumptions can be bypassed with significant effort into creating exactly the wrong operating conditions.
There are some examples of accidental flaws that affect hardware and aren’t fatal to its running software, but they’re an insignificant fraction of the number of failures due to incorrect software.
I agree that people are good at making hardware that works reasonably reliably. And I think that if you were to make an arbitrary complex program, the probability that it would fail from hardware-related bugs would be far lower than the probability of it failing for some other reason.
But the point I’m trying to make is that an AI, it seems to me, would be vastly more likely to run into something that exploits a hardware-level bug than an arbitrary complex program. For details on why I imagine so, please see this comment.
I’m trying to anticipate where someone could be confused about the comment I linked to, so I want to clarify something. Let S be the statement, “The AI comes across a possible world that causes its utility function to return very high value due to hardware bug exploitation”. Then it’s true that, if the AI has yet to find the error-causing world, the AI would not want to find it. Because utility(S) is low. However, this does not mean that the AI’s planning or optimization algorithm exerts no optimization pressure towards finding S.
Imagine the AI’s optimization algorithm as a black boxes that take as input a utility function and search space and output solutions that score highly on its utility function. Given that we don’t know what future AI will look like, I don’t think we can have a model of the AI much more informative than the above. And the hardware-error-caused world could score very, very highly on the utility function, much more so than any non-hardware-error-caused world. So I don’t think it should be too surprising if a powerful optimization algorithm finds it.
Yes, utility(S) is low, but that doesn’t mean the optimization actually calls utility(S) or uses it to adjust how it searches.
I think there are at least three different things being called “the utility function” here, and that’s causing confusion:
The utility function as specified in the software, mapping possible worlds to values. Let’s call this S.
The utility function as it is implemented running on actual hardware. Let’s call this H.
A representation of the utility function that can be passed as data to a black box optimizer. Let’s call this R.
You seem to be saying that in the software design of your AI, R = H. That is, that the black box will be given some data representing the Al’s hardware and other constraints, and return a possible world maximizing H.
From my point of view, that’s already a design fault. The designers of this AI want S maximized, not H. The AI itself wants S maximized instead of H in all circumstances where the hardware flaw doesn’t trigger. Who chose to pass H into the optimizer?
You seem to be saying that in the software design of your AI, R = H. That is, that the black box will be given some data representing the Al’s hardware and other constraints, and return a possible world maximizing H.
From my point of view, that’s already a design fault.
I agree; this is a design flaw. The issue is, I have yet to come across any optimization, planning algorithm, or AI architecture that doesn’t have this design flaw.
That is, I don’t know of any AI architecture that does not involve using a potentially hardware-bug-exploitable utility function as input into some planning or optimization problem. And I’m not sure there even is one.
In the rest of this comment I’ll just suggest approaches and show how they are still vulnerable to the hardware-bug-exploitation problem.
I have some degree of background in artificial intelligence, and the planning and optimization algorithms I’ve seen take the function to be maximized as an input parameter. Then, when people want to make an AI, they just call that planning or optimization algorithm with their (hardware-bug-exploitable) utility or cost functions. For example, suppose someone wants to make a plan that minimizes cost function f in search space s. Then I think they just directly do something like:
return a_star(f, s)
And this doesn’t provide any protection from hardware-level exploitation.
Now, correct me if I’m wrong, but it seems your thinking of the AI first doing some pre-processing to find an input to the planning or optimization algorithm that is resistant to hardware-bug-exploitation.
But how do you actually do that? You could regard the input the AI puts into the optimization function to be a choice it makes. But how does it make this choice? The only thing I can think of is having a planning or optimization algorithm figure out out what function to use as the input to the optimization or planning algorithm.
But if you need to use a planning or optimization algorithm to do this, then what utility function do you pass into this planning or optimization algorithm? You could try to pass the actual, current, hardware-bug-exploitable utility function. But then this doesn’t resolve the problem of hardware-bug-exploitation: when coming up with a utility function to input to the optimization, the AI may find such an input that itself scores very high due to hardware bug exploitation.
To describe the above more concretely, you could try doing something like this:
with its hardware-bug-exploitable utility function. Thus, the output, reasonable_utility_function_use, might be very wrong due to hardware bug exploitation having been used to come up with this.
Now, you might have had some other idea in mind. I don’t know of a concrete way to get around this problem, so I’m very interested to hear your thoughts.
My concern is that people will figure out how to make powerful optimization and planning algorithms without first figuring out how to fix this design flaw.
The issue is, I have yet to come across any optimization, planning algorithm, or AI architecture that doesn’t have this design flaw.
Yes you have. None of the these optimization procedures analyze the hardware implementation of a function in order to maximize it.
The rest of your comment is irrelevant, because what you have been describing is vastly worse than merely calling the function. If you merely call the function, you won’t find these hardware exploits. You only find them when analyzing the implementation. But the optimizer isn’t given access to the implementation details, only to the results.
If you prefer, you can cast the problem in terms of differing search spaces. As designed, the function U maps representations of possible worlds to utility values. When optimizing, you make various assumptions about the structure of the function—usually assumed to be continuous, sometimes differentiable, but in particular you always assume that it’s a function of its input.
The fault means that under some conditions that are extremely unlikely in practice, the value returned is not a function of the input. It’s a function of input and a history of the hardware implementing it. There is no way for the optimizer to determine this, or anything about the conditions that might trigger it, because they are outside its search space. The only way to get an optimizer that searches for such hardware flaws is to design it to search for them.
In other words pass the hardware design, not just the results of evaluation, to a suitably powerful optimizer.
I was wondering if anyone would be interested in reviewing some articles I was thinking about posting. I’m trying to make them as high-quality as I can, and I think getting them reviewed by someone would be helpful for making Less Wrong contain high-quality content.
I have four articles I’m interested in having reviewed. Two are about new alignment techniques, one is about a potential danger with AI that I haven’t seen discussed before, and one is about the simulation argument. All are fairly short.
If you’re interested, just let me know and I care share drafts of any articles you would like to see.
I’ve read this paper on low-impact AIs. There’s something about it that I’m confused and skeptical about.
One of the main methods it proposes works as follows. Find a probability distribution of many possible variables in the world. Let X represent the statement “The AI was turned on”. For each the variables v it considers, the probability distribution over v should, after conditioning on X should, look about the same as the probability distribution over v after conditioning on not-X. That’s low impact.
But the paper doesn’t mention conditioning on any evidence other than X. But, a priori, the probability of the specific AI even existing in the first place is possibly quite low. So simply conditioning on X has the potentially to change your probability distribution over variables of the world, simply because it lets you know that the AI exists.
You could try to get around this by, when calculating a probability distribution of a variable v, also update on the other evidence E the AI has. But if you do this, then I don’t think there would be much difference in P(v|EX) and P(v|E not-X). This is because if the AI can update on the rest of its evidence, it can just infer the current state of the world. For example, if the AI clearly sees the world has been converted to paperclips, I think it would still think the world would be mostly paperclip even on conditioning on “I was never turned on”. Maybe the AI would imagine some other AI did it.
I’m interested in seeing what others think about this.
I’m questioning whether we would actually want to use Updateless Decision Theory, Functional Decision Theory, or future decision theories like them.
I think that in sufficiently extreme cases, I would act according to Evidential Decision Theory and not according something like UDT, FDT, or any similar successor. And I think I would continue to want to take the evidential decision theoretic-recommended action instead even if I had arbitrarily high intelligence, willpower, and had infinitely long to think about it. And, though I’d like to hear others’ thoughts on this, I suspect others would do the same.
I’ll provide an example of when this would happen.
Before that, consider regular XOR extortion: You get a message from a truthworthy predictor that says, “I will send you this message if you send me $10, or if your house is about to be ruined by carpenter ants, but not if both happen.” UDT and FDT recommend not paying them money. And if I were in that situation, I bet I wouldn’t pay, either.
However, imagine changing the XOR extortion to be as follows: the message now says “I will send you this message if you send me $10, or if you and all your family and friends will be severely tortured until heat death, but not both.
In that situation, I’d pay the $10, assuming the probability of the torture actually happening is significant. But FDT and UDT would, I think, recommend not paying it.
And I don’t think it’s irrational I’d pay.
Feel free to correct me, but the main reasons people seem to like UDT and FDT is that agents that use it would “on average” perform better than those using other decision theories, in fair circumstances. And sure, the average agent implementing a decision policy that says to not pay would probably get higher utility in expectation than the average agent would would pay, due to spending less money paying up from extortion. And that by giving in to the extortion, agents that implement approximately the same decision procedure I do would on average get less utility.
And I think the face that UDT and FDT agents systematically outperform arbitary EDT agents is something that matters to me. But still, I only care about it my actions conforming the best-performing decision theories to so a limited extent. What I really, really care about is not having me, the actual, current me, be sent to a horrible fate filled with eternal agony. I think my dread of this would be enough to make me pay the $10, despite any sort of abstract argument in favor of not paying.
So I wouldn’t take the action UDT or FDT would recommend, and would just use evidential decision theory. This makes me question whether we should use something like UDT or FDT when actually making AI. Suppose UDT recommended the AI take some action a. And suppose it was foreseeable that, though such a percept-action mapping would perform well in general, for us it would totally give us the short end of the stick. For example, suppose it said to not give in to some form of extortion, even though if we didn’t we would all get tortured until heat death. Would we really want the AI to go not pay up, and then get us all tortured?
I’m talked previously about how evidential decision theory can be used to emulate the actions of an arbitrary agent using a more “advanced” decision theory by just defining terminal values on the truth value of mathematical objects representing answers to the question of what would have happened in other hypothetical situations. For example, you could make an Evidential Decision Theory agent act similarly to a UDT agent in non-extreme cases by placing making its utility function place high value to the answer to a question something like, “if you imagine a formal reasoning system and you have it condition on the statement <insert mathematical description of my decision procedure> results in recommending the percept-action mapping m, then a priori agents in general with my utility function would get expected utility of x”.
This way, we can still make decisions that would score reasonably highly according to UDT and FDT, while not being obligated to get ourselves tortured.
Also, it seems to me that UDT and FDT are all about, basically, in some situations making yourself knowably worse-off than you could have, roughly because agents in general who would take the action in that situation would get higher utility in expectation. I want to say that these sorts of procedures seem concerningly hackable. In principle, other opportunistic civilizations could create agents any circumstances in order to change the best percept-action mapping to use a priori and thus change what AI’s on Earth could use.
I provide a method to “hack” UDT here. Wei Dai agreed that it was a reasonable concern in private conversation.
This is why I’m skeptical about the value of UDT, FDT, and related theories, and think that perhaps we would be best off just sticking with EDT but with terminal values that can be used to approximately emulate the other decision theories when we would like to.
I haven’t heard these considerations mentioned before, so I’m interested in links to any previous discussion or comments explaining what you think of it.
I’m wondering how, in principal, we should deal with malign priors. Specifically, I’m wondering what to do about the possibility that reality itself is, in a sense, malign.
I had previously said that it seems really hard to verifiably learn a non-malign prior. However, now I’ve realized that I’m not even sure what a non-malign, but still reliable, prior would even look like.
In previous discussion of malign priors, I’ve seen people talk about the AI misbehaving due to thinking it’s in some embedded in a simpler universe than our own that was controlled by agents trying to influence the AI’s predictions and thus decision. However, the issue is, even if the AI does form a correct understanding of the universe it’s actually in, it seems quite plausible to me that the AI’s predictions would still be malign.
I saw this because it sounds plausible to me that most agents experiencing what the first generally-intelligent AIs on Earth are actually in simulations, and the simulations could then be manipulated by whoever made them to influence the AIs predictions and actions.
For example, consider an AI learning a reward function. If it looks for the simplest, highest-prior probability models that output its observed rewards, even in this universe, it might conclude that it is in some booby-trapped simulation that rewards taking over the world and giving control to aliens.
So in this sense, even if the AIs are correct about being in our universe, the actual predictions the AIs would make about their future rewards, and the environment they’re in, would quite possibly be malign.
Now, you could try to deal with this by making the AI think that it’s in the actual, non-simulated Earth. However, it’s quite possible that, for almost all of the actual AIs, this is wrong. So the simulations of the AIs would also believe they weren’t in simulations. Which means that there would be many powerful AIs that are quite wrong about the nature of their world.
And having so many powerful AIs be so wrong sounds dangerous. As an example of how this could go wrong, imagine if some aliens proposed a bet with the AI: if you aren’t in a simulation, I’ll give you control of 1% of my world; if you are, you’ll give me 1% control of your world. If the AI was convinced it wasn’t in a simulation, I think it would take that bet. Then the bet could potentially be repeated until everything is controlled by the aliens.
One idea I had was to have the AI learn models that are in some sense “naive” that predicts percepts in some way that wouldn’t result in dangerous things like a malign prior would have. Then, make the AI believe that these models are just “naive” models of its percepts, rather than what’s actually going to happen in the AI’s environment. Then define what the AI should do based on the naive models.
In other words, the AI’s beliefs would simply be about logical statements of the form, “This ‘naive’ induction system, given the provided percepts, would have a next prediction of x”. And then you would use these logical statements to determine the AI’s behavior somehow.
This way, the AIs could potentially avoid issues with malign priors without having any beliefs that are actually wrong.
This seems like a pretty reasonable approach to me, but I’m interested in what others think. I haven’t seen this discussed before, but it might have been, and I would appreciate a link to any previous discussions.
I’ve been reading about logical induction. I read that logical induction was considered a breakthrough, but I’m having a hard understanding the significance of it. I’m having a hard time seeing how it outperforms what I call “the naive approach” to logical uncertainty. I imagine there is some sort of notable benefit of it I’m missing, so I would very much appreciate some feedback.
First, I’ll explain what I mean by “the naive approach”. Consider asking an AI developer with no special background in reasoning under logical uncertainty how to make an algorithm to come to accurate probability estimates to logical statements. I think that that the answer is that they would just use standard AI techniques to search through the space of reasonably efficient possible programs for generating probability assignments to logical statements, is reasonably simple relative to the amount of data to avoid overfitting, and has as high a predictive accuracy as possible. Then they would use this to make predictions about logical statements.
If you want, you can also make this approach cleaner by using some idealized induction system, like Solomonoff induction, instead of messy, regular machine learning techniques. I still consider this the naive approach.
It seems to me that the naive approach, being used with a sufficiently powerful optimization algorithm, would output similar probability assignments to logical induction.
Logical induction says to come up with probability assignments that, when imagined to be market prices, cannot be “exploited” by any efficiently-computable betting strategy.
But why wouldn’t the naive approach do the same thing? If there was an efficient strategy to exploit probability assignments an algorithm that would give, then I think you could make a new, more efficient but easily computable strategy that comes up with more accurate probability assignments to avoid the exploitation. And so the machine learning algorithm, if sufficiently powerful, could find it.
If one system for outputting probability assignments to logical statements could be exploited by an efficient strategy, a new system for outputting probability assignments could be made that performs better by adjusting prices so that the strategy can no longer exploit the market.
To see it another way, it seems to me that if there is some way to exploit the market, then that’s because there is some way to accurately and efficiently predict when the system’s pricing are wrong, and this could be used to form some pricing strategy that could exploit the agent. So if you instead use a different algorithm that’s like the original one but adjusted to avoid being exploitable by that strategy, that would make a program that outputs probability assignments with higher predictive accuracy. So a sufficiently powerful optimizer could find it with the naive approach.
Consider the possibility that the naive approach is used with a powerful-enough optimization algorithm that it can find the very best-performing efficient and non-overfitted strategy of predicting prices among its data. Its not clear to me how such an algorithm could be exploitable by a trader. Even if there were some problems in the initial algorithm learned, it further learning could avoid being exploited. Maybe there is still somehow some way to do some sort of minor exploitation to such a system, but it’s not clear how it could be done to any significant degree.
So, if I’m reasoning correctly, it seems that the naive approach could end up approximating logical induction anyways, or perhaps exactly perform it in the case of unlimited processing power.
I’ve thought of a way in which other civilizations could potentially “hack” Updateless Decision Theoretic agents on Earth in order to make them do whatever the other civilization wants them to do. I’m wondering if this has been discussed before, and if not, what people think about it.
Here I present a method of that would potentially aliens to take control of an AI on Earth that uses Updateless Decision theory.
Note that this crucially depends on different agents with the AI’s utility function but different situations terminally valuing different things. For example, suppose the AI places special value in the welfare of its creators and other creatures in the world it’s in. An AI with the same utility function in a different world would then place more terminal value the welfare of the creatures in its world. This doesn’t sound like a very stringent requirement; people are not infinitely altruistic, so naturally people may want AI to favor the welfare of themselves over random aliens.
Suppose on Earth an AI is created that uses UDT and has a reasonable utility function. And suppose there are some aliens who want to acausally take over Earth. Let T be the percept-action mapping that would result in the AI conquer Earth and give control to the aliens.
So, the aliens make a world that’s like Earth but has a trillion AIs on it using UDT and that have the same utility function as the AI on Earth. If any of the trillion AIs don’t use T their percept-action mapping, then the Earth-like world would get destroyed by the aliens, killing all the inhabitants the AIs care about.
The chosen percept-output mapping chosen by UDT is done before updating on any evidence about the AI’s current circumstances. So the AIs both on Earth and on the Earth-like alien planet reason, “For 99.99999999...% of agents with my utility, it would be essential to use T. And for only 0.000000...1% is it essential to not use it. So using it would a priori be much better for my utility function. So that’s what I’ll do.”
And then we die.
And this would be pretty cheap for the aliens. With just a, by their standards, small amount of resources to make tons of UDT AIs, they would have the potential to take control of huge numbers of worlds. With just one world filled with AIs that need to use T, they could potentially get huge numbers of AIs on Earth-like worlds to also use T, giving the aliens control over many parts of the universe.
I think the essence of the problem is that UDT makes agents that score well on their utility function “on average” for an arbitrary UDT agent with that utility function, but what we really want is one that scores well in the circumstances we’re actually in.
One way to potentially get around this problem is by changing the utility function to avoid the level of indirection. That is, instead of making the AI’s utility function something like, “Value whatever my creators wanted me to value”, instead first run a non-agentic AI that infers what the creators would want the AI to value, and then use that as a fixed utility function. For example, it could result in the AI finding a utility function, “Make creatures reasonably satisfied on Earth, but also give substantial moral concern to the welfare of creatures outside of Earth”. That way, hopefully we wouldn’t get taken over by aliens.
I don’t know how to make the math do this, but an intuitive UDT agent isn’t supposed to give in to threats. (What’s a threat? IDK.) The threat happens like so: the aliens think about how to get what they want; they think about the UDTAI; they think that if they do this threat, the UDTAI will do what they want; so they do the threat. The UDTAI is supposed to view the part where the aliens think about what the UDT will do, as another instance of the UDTAI (simulated, or interacted with somehow, in the aliens’s minds). Then it’s supposed to see that if it doesn’t respond to the alien’s threat, the alien won’t make the threat. “What if the alien would make the threat anyway?” Well, this makes the hypothetical unnatural; you’ve drawn attention to this alien who’s getting the UDTAI to do what it wants, BUT, you’ve specified that it’s somehow doing this not because it expects that to get it what it wants by thinking about the UDTAI. (Again, IDK how to make the math do this and there’s clear holes, but maybe it’s a useful provocation.)
Oh, my mistake, I forgot to post the correction that made it not extortion.
Instead of threatening to destroy the AI’s world, imagine the aliens instead offer to help them. Suppose the AI’s can’t be their world a utopia on their own, for example because it’s nothing but a solid ball of ice. So then the aliens would make their world a utopia as long as they execute S. Then they would execute S.
I’m actually pretty skeptical of the idea that UDTAIs wouldn’t give into extortion, but this is a separate point that wasn’t necessary to address in my example. Specifically, you say it’s unnatural to suppose how is the counterfactual “the aliens would threaten the AIs anyways, even if they won’t give in”. How is this anymore unnatural than the counterfactual “the AI would avoid submitting to extortion, even if the aliens would threaten the AIs anyways”.
Are you saying this is the wrong thing to do in that situation? That just sounds like trade. (Assuming of course that we trust our AI’s reasoning about the likely consequences of doing S.)
>Specifically, you say it’s unnatural to suppose how is the counterfactual “the aliens would threaten the AIs anyways, even if they won’t give in”. How is this anymore unnatural than the counterfactual “the AI would avoid submitting to extortion, even if the aliens would threaten the AIs anyways”.
It’s unnatural to assume that the aliens would threaten the AI without reasoning (possibly acausally) about the consequences of them making that threat, which involves reasoning about how the AI would respond, which makes the aliens involved in a mutual decision situation with the AI, which means UDTAI might have reason to not yield to the extortion, because it can (acausally) affect how the aliens behave (e.g. whether they decide to make a threat).
The problem is that, if the best percept-action mapping is S, then the UDTs in Earth would use it, too. Which would result in us being taken over. I’m not saying that it’s an irrational choice for the AIs to make, but it wouldn’t end well for us.
I’m having some trouble following your reasoning about extortion, though. Suppose both the aliens and AIs use UDT. I think you’re reasoning something like, “If the AIs commit to never be extorted no matter what the aliens would do, then the aliens wouldn’t bother to extort them”. But this seems symmetric to reasoning as, “If the aliens commit to extorting and dulling out the punishment no matter what the AIs would do, then the AIs wouldn’t bother to resist the extortion”. So I’m not sure why the second line of reasoning would be less likely to occur than the first.
Re: symmetry. I think you interpreted right. (Upvoted for symmetry comment.) Part of my original point was trying to say something like “it’s unnatural to have aliens making these sorts of threats without engaging in an acausal relationship with the UDTAI”, but yeah also I was assuming the threat-ignorer would “win” the acausal conflict, which doesn’t seem necessarily right. If the aliens are engaging that way, then yeah, I don’t know how to make threats vs. ignoring threats be asymmetric in a principled way.
I mean, the intuition is that there’s a “default” where the agents “don’t interact at all”, and deviations from the default can be trades if there’s upside chances over the default and threats if there’s downside chances. And to “escalate” from the “default” with a “threat” makes you the “aggressor”, and for some reason “aggressors” have the worse position for acausal conflict, maybe? IDK.
Well, I can’t say I have that intuition, but it is a possibility.
It’s a nice idea: a world without extortion sounds good. But remember that, though we want this, we should be careful to avoid wishful thinking swaying us.
In actual causal conflicts among humans, the aggressors don’t seem to be in a worse position. Things might be different from acausal UDT trades, but I’m not sure why it would be.
> I’m not saying that it’s an irrational choice for the AIs to make, but it wouldn’t end well for us.
I guess I’m auto-translating from “the AI uses UDT, but its utility function depends on its terminal values” into “the AI has a distribution over worlds (and utility functions)”, so that the AI is best thought of as representing the coalition of all those utility functions. Then either the aliens have enough resources to simulate a bunch of stuff that has more value to that coalition than the value of our “actual” world, or not. If yes, it seems like a fine trade. If not, there’s no issue.
Well, actually, I’m considering both the AIs on Earth and on the alien planet to have the same utility function. If I understand correctly, UDT says to maximize the expected utility of your own utility function a prior, rather than that of agents with different utility functions.
The issue is, some agents with the same utility function, in effect, have different terminal values. For example, consider a utility function saying something like, “maximize the welfare of creatures in the world I’m from.” Then, even with the same utility functions, the AIs in the alien world and the ones on Earth would have very different values.
Then either the aliens have enough resources to simulate a bunch of stuff that has more value to that coalition than the value of our “actual” world, or not. If yes, it seems like a fine trade.
I don’t think so. Imagine the alien-created utopia would be much less good than the one we could make on Earth. For example, suppose the alien-created utopia would have a utility of 1 for the AIs there and the one on Earth would have a utility of 10. And otherwise the AIs would have a utility of 0. But suppose there’s a million times more AIs in the alien world than on Earth. Then it would be around a million times more likely a prior that the AI would find itself in the alien world than on Earth. So the expected utility of using S would be approximately, 999999/1000000∗1+1/1000000∗0≈1
And the expected utility of not using S and instead letting yourself build a utopia would be approximately, 999999/1000000∗0+1/1000000∗10≈0
As you see, the AIs still would choose to execute S, even if though this would provide less moral value. It could also kill us.
I don’t know how to understand the prior that the AI puts over worlds (the thing that says, a priori, that there’s 1000000 of this kind and 1 of that kind) as anything other than part of its utility function. So this doesn’t seem like a problem with UDT, but a problem with the utility function. Maybe your argument does show that we want to treat uncertainty about the utility function differently than other uncertainty? Like, when we resolve uncertainty that’s “merely about the world”, as in for example the transparent Newcomb’s problem, we still want to follow the updateless policy that’s best a priori. But maybe your argument shows that resolving uncertainty about the utility function can’t be treated the same way; when we see that we’re a UDTAI for humans, we’re supposed to actually update, and stop optimizing for other people.
I don’t know how to understand the prior that the AI puts over worlds (the thing that says, a priori, that there’s 1000000 of this kind and 1 of that kind) as anything other than part of its utility function.
Could you explain you reasoning? The utility function is a fixed function. The AI already knows it and does not need to associate a probability with it. Remember that both the AIs in the alien world and the AIs on Earth have the same utility function.
Saying it’s a million times more likely to end up in the alien world is a question about prior probabilities, not utility functions. What I’m saying is that, a priori, the AI may think it’s far more probable that it would be an AI in the alien world, and that this could result in very bad things for us.
They’re pretty much the same. If you could come up with a prior that would make the AI convinced it would be on Earth, then this could potentially make fix the problem. However, coming up with a prior probability distribution that guarantees the AI is in the nebulous concept of “Earth, as we imagine it” sounds very tough to come up with. Also, this could interfere with the reliability of the AI’s reasoning. Thinking that it’s guaranteed to be on Earth is just not a reasonable thing to think a priori. This irrationality may make the AI perform poorly in other ways.
Well, so “expressing how much you’re going to try to optimize different worlds” sounds to me like it’s equivalent to / interchangeable with a multiplicative factor in your utility function.
Anyway, re/ the rest of your comment, my (off the cuff) proposal above was to let the AI be uncertain as to what exactly this “Earth” thing is, and to let it be *updateful* (rather than updateless) about information about what “Earth” means, and generally about information that clarifies the meaning of the utility function. So AIs that wake up on Earth will update that “the planet I’m on” means Earth, and will only care about Earth; AIs that wake up on e.g. Htrae will update that “the planet I’m on” is Htrae, and will not care about Earth. The Earth AI will not have already chosen a policy of S, since it doesn’t in general chose policies updatelessly. This is analogous to how children imprint on lessons and values they get from their environment; they don’t keep optimizing timelessly for all the ways they could have been, including ways that they now consider bad, even though they can optimize timelessly in other ways.
One question would be, is this a bad thing to do? Relative to being updateless, it seems like caring less about other people, or refusing to bargain / coordinate to realize gains from trade with aliens. On the other hand, maybe it avoids killing us in the way you describe, which seems good. Otoh, maybe this is trying to renege on possible bargains with the Htrae people, and is therefore not in our best interests overall.
Another question would be, is this stable under reflection? The usual argument is: if you’re NOT updateless about some variable X (in this case X = “the planet I’m on (and am supposed to care about)”), then before you have resolved your uncertainty about X, you can realize gains from trade between possible future versions of yourself: by doing things that are very good according to [you who believes X=Htrae] but are slightly bad according to [you who believes X=Earth], you increase your current overall expectation of utility. And both the Htraeans and the Earthians will have wanted you to indeed decide (before knowing who in particular this would benefit) to follow a policy of making policy decisions under uncertainty that increase the total expected utility in advance of you knowing who you’re supposed to be optimizing for.
Maybe the point is that since probabilities and utilities can be marginally interchanged for each other, there’s no determinate “utility function” that one could be updateful about while being updateless about the remaining “probabilities”. And therefore the above semi-updateful thing is incoherent, or indeterminate (or equivalent to reneging on bargains).
So this goes back to my comment above that the alien threateners are just setting up a trade opportunity between you and the Htraeans, and maybe it’s a good trade, and if so it’s fine that you die because that’s what you wanted on net. But it does seem counterintuitive that if I’m better at pointing to my utility function, or something, then I have a better bargaining position?
The semi-updateful thing is more appealing when I remember that it can still bargain with its cousins later if it wants to. The issue is whether that bargaining can be made mutually transparent even if it’s happening later (after real updates). You can only acausally bargain with someone if you can know that some of your decision making is connected with some of theirs (for example by having the exact same structure, or by having some exactly shared structure and some variance with a legible relationship to the shared structure as in the Earth-AI/Htrae-AI case), so that you can decide for them to give you what you want (by deciding to give them what they want). If you’re a baby UDT who might grow up to be Earthian or Htraean, you can do the bargaining for free because you are entirely made of shared structure between the pasts of your two possible futures. But there’s other ways, maybe, like bargaining after you’ve grown up. So to some extent updateless vs updateful is a question of how much bargaining you can, or want to, defer, vs bake in.
I think your semi-updateless idea is pretty interesting. The main issue I’m concerned about is finding a way to update on the things we want to have updated on, but not on the things we don’t want updated on.
As as example, consider Newcomb’s problem. There are two boxes. A superintelligent predictor will put $1000 in one box and $10 in the other if it predicts you will only take one box. Otherwise it doesn’t add money to either box. You see one is transparent and contains $1000.
I’m concerned the semi-updateless agent would reason as follows: “Well, since their’s money in the one box, their must be money in the other box. So, clearly that means this “Earth” thing I’m in is a place in which there is money in both boxes in front of me. I only care about how well I do in this “Earth” place, and clearly I’d do better if I got the money from the second box. So I’ll two-box.
But that’s the wrong choice. Because agents who would two-box end up with $0.
One intuitive way this case could work out, is if the SUDT could say “Ok, I’m in this Earth. And these Earthians consider themselves ‘the same as’ (or close enough) the alt-Earthians from the world where I’m actually inside a simulation that Omega is running to predict what I would do; so, though I’m only taking orders from these Earthians, I still want to act timelessly in this case”. This might be sort of vacuous, since it’s just referring back to the humans’s intuitions about decision theory (what they consider “the same as” themselves) rather than actually using the AI to do the decision theory, or making the decision theory explicit. But at least it sort of uses some of the AI’s intelligence to apply the humans’s intuitions across more lines of hypothetical reasoning than the humans could do by themselves.
Something seems pretty weird about all this reasoning though. For one thing, there’s a sense that you sort of “travel backwards in logical time” as you think longer in normal time. Like, first you don’t know about TDT, and then you invent TDT, and UDT, and then you can do UDT better. So you start making decisions in accordance with policies you’d’ve wanted to pick “a priori” (earlier in some kind of “time”). But like what’s going on? We could say that UDT is convergent, as the only thing that’s reflectively stable, or as the only kind of thing that can be pareto optimal in conflicts, or something like that. But how do we make sense of our actual reasoning before having invented UDT? Is the job of that reasoning not to invent UDT, but just to avoid avoiding adopting UDT?
I don’t know how to formalize the reasoning process that goes into how we choose decision theories. And I doubt anyone does. Because if you could formalize the reasoning we use, then you could (indirectly) formalize decision theory itself as being, “whatever decision theory we would use given unlimited reflection”.
I don’t really think UDT is necessarily reflectively stable, or the only decision theory that is. I’ve argued previously that I, in certain situations, would act essential as an evidential decision theorist. I’m not sure what others think of this, though, since no one actually ever replied to me.
I don’t think UDT is pareto optimal in conflicts. If the agent is in a conflict with an irrational agent, then the resulting interaction between the two agents could easily be non-pareto optimal. For example, imagine a UDT agent is in a conflict with the same payoff to the prisoner’s dilemma. And suppose the agent it’s in conflict with is a causal decision theorist. Then the causal decision theorist would defect no matter what the UDT agent would do, so the UDT agent would also defect, and then everyone would do poorly.
Yeah I don’t know of a clear case for those supposed properties of UDT.
By pareto optimal I mean just, two UDT agents will pick a Pareto optimal policy. Whereas, say, two CDT agents may defect on each other in a PD.
This isn’t a proof, or even really a general argument, but one reason to suspect UDT is convergent, is that CDT would modify to be a sort of UDT-starting-now. At least, say you have a CDT agent, and further assume that it’s capable of computing the causal consequences of all possible complete action-policies it could follow. This agent would replace itself with P-bot, the bot that follows policy P, where P is the one with the best causal consequences at the time of replacement. This is different from CDT: if Omega scans P-bot the next day, P-bot will win the Transparent Newcomb’s problem, whereas if CDT hadn’t self-modified to be P-bot and Omega had scanned CDT tomorrow, CDT would fail the TNP for the usual reason. So CDT is in conflict with its future self.
Two UDT agents actually can potentially defect in prisoner’s dilemma. See the agent simulates predictor problem if you’re interested.
But I think you’re right that agents would generally modify themselves to more closely resemble UDT. Note, though, that the decision theory a CDT agent would modify itself to use wouldn’t exactly be UDT. For example, suppose the causal decision theory agent had its output predicted by Omega for Newcomb’s problem before the agent even came into existence. Then by the time the CDT agent comes to existence, modifying itself to use UDT would have no causal impact on the content of the boxes. So it wouldn’t adopt UDT in this situation and would still two-box.
Well, the way the agent loses in ASP is by failing to be updateless about certain logical facts (what the predictor predicts). So from this perspective, it’s a SemiUDT that does update whenever it learns logical facts, and this explains why it defects.
> So it wouldn’t adopt UDT in this situation and would still two-box.
True, it’s always [updateless, on everything after now].
I was wondering if there has been any work getting around specifying the “correct” decision theory by just using a more limited decision theory and adjusting terminal values to deal with this.
I think we might be able to get an agent that does what we want without formalizing the right decision theory buy instead making a modification to the value loading used. This way, even an AI with a simple, limited decision theory like evidential decision theory could make good choices.
I think that normally when considering value loading, people imagine finding a way to provide the AI answers to the question, “What preference ordering over possible worlds would I have, after sufficient reflection, which I would then use with whatever decision theory I would use upon sufficient reflection?”. My proposal is to instead make an evidential decision theory and change value-loading to instead answer the question, “What preference ordering would I, on sufficient reflection, want an agent that uses evidential decision theory to have”? This could be used with other decision theories, too.
In principle, you could make an evidential-decision-theoretic agent take the same actions an agent with a more sophisticated decision theory would.
One option is to modify the utility function to have a penalty for doing things contrary to your ideal decision theory. For example, suppose you, on reflection, would think that functional decision theory is the “correct” decision theory. Then when specifying the preference ordering for the agent, you could provide a penalty in situations in which the agent does something contrary to what functional decision theory would recommend.
Another option is to include preferences about mathematical objects representing what would have happened in some other logically possible world if the agent did a certain action. Then, the AI could have preferences about what that mathematical construct outputs. To be clear, though the construct is about what would happen in some other possible world, it’s an actual mathematical object, and statements about it are still true or false in the real world.
For example, suppose an AI is considering giving in to xor-extortion. Then the AI could see that, conditioning on it having a given output, AI’s like it in other possible worlds would on average do worse, and preferences against this could be loaded.
I don’t see anything implausible about being able to load preferences like those described in the second question into an AI, nor a clear reason to think is would be harder than loading preferences that answer the first one. Some of the techniques for value-loading I’ve seen involve getting the AI to learn terminal values from training data, and you could modify the learned terminal values by modifying the training data appropriately.
Another potential technique to use in value-loading is to somehow pick out the people in the AI’s world model and then query them for their values. Techniques like this could potentially be used to allow for appropriate loading of terminal values, for example, by querying people’s brains for a question like “what would you, on reflection, want an evidential-decision-theoretic agent to value?”, rather than what “would you, on reflection, what an agent using whatever decision theory you actually use to value?”
The advantage of using a simple decision theory and adjusting value loading is that the AI makes the right choice for what we want by just correct value-loading and just implementing a basic, easy decision theory, like evidential decision theory.
What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference? The problem is that you-on-reflection is not immediately available, it takes planning and action to compute it, planning and action taken without the benefit of its guidance, thus by default catastrophically misaligned. So an AI with some decision theory but without further connection to human values might win the capability race by reassembling literally everything into a computer needed to answer the question of whether doing that was good. (It’s still extremely unclear how to get even to that point, whatever decision theory is involved. In particular, this assumes that we can define you-on-reflection and thus we can define you, which is uploading. And what is “preference” specifically, so that it can be a result of a computation, usable by an agent in the real world?)
The way an AI thinks about the world is also the way it might think about predictions of what you-on-reflection says, in order to get a sense of what to do in advance of having computed the results more precisely (and computing them precisely is probably pointless if a useful kind of prediction is possible). So the practical point of decision theory is deconfusion, figuring out how to accomplish things without resorting to an all-devouring black box.
What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference?
On reflection, there probably is not much difference. This is a good point. Still, an AI that just computes what you would want to it do, for example with approval-based AI or mimicry, also seems like a useful way of getting around specifying a decision theory. I haven’t seen much discussion about the issues with the approach, so I’m interested in what problems could occur that using the right decision theory could solve, if any.
The problem is that you-on-reflection is not immediately available, it takes planning and action to compute it, planning and action taken without the benefit of its guidance, thus by default catastrophically misaligned.
True. Note, though, that you-on-reflection is not immediately available to an Ai with the correct decision theory, either. Whether your AI uses the right or wrong decision theory, it still takes effort to figure out what you-on-reflection would want. I don’t see how this is a bigger problem for agents with primitive decision theories, though.
One way to try to deal with this is to have your AI learn a reasonably accurate model of you-on-reflection before it becomes dangerously intelligent, so that way, once it does become superintelligent, it will (hopefully) work reasonably. And again, this works both with a primitive and sophisticated decision theory.
So the practical point of decision theory is deconfusion
Okay. I’m having a hard time thinking concretely about how concretely getting less confused about decision theory would help us, but I intuitively imagine it could help somehow. Do you know, more concretely, of the benefits of this deconfusion?
I’m having a hard time thinking concretely about how concretely getting less confused about decision theory would help us, but I intuitively imagine it could help somehow. Do you know, more concretely, of the benefits of this deconfusion?
It’s a methodology for AI design, the way science is a methodology for engineering, a source of desiderata for what’s important for various purposes. The activity of developing decision theories is itself like the thought experiments it uses, or like apparatus of experimental physics, a way of isolating some consideration from other confusing aspects and magnifying its effects to see more clearly. This teaches lessons that may eventually be used in the separate activity of engineering better devices.
What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference?
On reflection, there probably is not much difference.
Well, there is a huge difference, it’s just not in how the decisions of you-on-reflections get processed by some decision theory vs. repeated without change. The setup of you-on-reflection can be thought of as an algorithm, and the decisions or declared preferences are the results of its computation. Computation of an abstract algorithm doesn’t automatically get to affect the real world, as it may fail to actually get carried out, so it has to be channeled by a process that takes place there. And for the purpose of channeling your decisions, a program that just runs your algorithm is no good, it won’t survive AI x-risks (from other AIs, assuming the risks are not resolved), and so won’t get to channel your decisions. On the other hand, a program that runs a sufficiently sane decision theory might be able to survive (including by destroying everything else potentially dangerous to its survival) and eventually get around to computing your decision and affecting the world with it.
When discussing the idea of a program implementing what you on reflection would do, I think we had different ideas in mind. What I meant was that every action the AI would take would be its best approximation of what you-on-reflection would want. This doesn’t sound dangerous to me. I think that approval-based AI and iterated amplification with HCH would be two ways of making approximations to the output of you-on-reflection. And I don’t think they’re unworkably dangerous.
If the AI is instead allowed to take arbitrarily many unaligned actions before taking the actions you’d recommend, then you are right in that that would be very dangerous. I think this was the idea you had in mind, but feel free to correct me.
If we did misunderstand each other, I apologize. If not, then is there something I’m missing? I would think that a program that faithfully outputs some approximation of “what I’d want on reflection” on every action it takes would not perform devastatingly badly. I on reflection wouldn’t want the world destroyed, so I don’t think it would take actions that would destroy it.
I’ve made a few posts that seemed to contain potentially valuable ideas related to AI safety. However, I got almost no feedback on them, so I was hoping some people could look at them and tell me what they think. They still seem valid to me, and if they are, they could potentially be very valuable contributions. And if they aren’t valid, then I think knowing the reason for this could potentially help me a lot in my future efforts towards contributing to AI safety.
The posts are:
My critique of a published impact measure.
Manual alignment
Alignment via reverse engineering
I made a new article about defining “optimizer”. I was wondering if someone could look over it and tell me what they think before I post it on Less Wrong. You can find it here.
There is a matter I’m confused about: What exactly is base-level reality, does it necessarily exist, and is it ontologically different from other constructs?
First off, I had gotten the impression that there was a base-level reality, and that in some sense it’s ontologically different from the sorts of abstractions we use in our models. I thought that, it some sense, the subatomic particles “actually” existed, whereas our abstractions, like chairs, were “just” abstractions. I’m not actually sure how I got this impression, but I had the sense that other people thought this way, too.
And indeed, you could adopt an epistemology that would imply this. But I’m not sure what the benefit of doing so would be. Suppose people discovered lower-level particles that composed quantum particles, and modeling using these lower-level particles would provide high predictive accuracy than using mere quantum physics. But then suppose people discover sub-sub-quantum particles and that modeling the world in terms of these sub-sub-particles further yielded a more accurate world model than just modeling with sub-quantum particles. And what if this process continued forever: people just kept finding lower-level particles that composed higher-level particles and had higher predictive accuracy.
In the above situation, what’s supposed to be taken to be base-level reality? Now, if you wanted, you could imagine that the world actually does have a base-level reality in the form of an infinite-memory computer, and that this computer dynamically generates new abstractions to uses them to compute what the agents see, making sure that it manages to start simulating things at a lower level of abstraction before any agent could reach the current “base-level” reality.
But that doesn’t seem like a very natural hypothesis. If you keep finding more and more decompositions forever, it really seems to me that “there’s no base-level reality” would be a simpler and more natural hypothesis.
Distinguishing the physical world from mathematical entities is pragmatic, reflects how it relates to you. It’s impossible to fully know what the physical world is, but it’s possible to interact with it (and to care about what happens in it), and these interactions depend on what it is. When reasoning about constructed mathematical entities, you get to know what you are working with, but not in the case of the physical world. So we can similarly consider an agent living in a different mathematical entity, and for that agent that mathematical entity would be their real physical world.
Because we have to deal with the presence of the real world, it might be convenient to develop concepts that don’t presume knowledge of its nature, which should apply to mathematical entities if we forget (in some perspective) what they are. It’s also relevant to recall that the idea of a “mathematical entity” is informal, so strictly speaking it doesn’t make sense to claim that the physical world is “a mathematical entity”, because we can’t carefully say what exactly “a mathematical entity” is in general, there are only more specific examples that we don’t know the physical world to be one of.
Reality is that which actually exists, regardless of how any agents within it might perceive it, choose to model it, or describe it to each other.
If reality happens to be infinitely complex, then all finite models of it must necessarily be incomplete. That might be annoying, but why would you consider that to mean “reality doesn’t really exist”?
Well, to be clear, I didn’t intend to say that reality doesn’t really exist. There’s definitely something that’s real. I was just wondering about if there is some base-level reality that’s ontologically different from other things, like the abstractions we use.
Now, what I’m saying feels pretty philosophical, and perhaps the question isn’t even meaningful.
Still, I’m wondering about the agents making an infinite sequence of decompositions that each have increased predictive accuracy. What would the base-level reality be in that case? Any of the decompositions the agents create would be wrong, even if some are infinitely complex.
Also, I’ve realize I’m confused about the meaning of “what really exists”, but I think it would be hard to clarify and reason about this. Perhaps I’m overthinking things, but I am still rather confused.
I’m imagining some other agent or AI that doesn’t distinguish between base-level reality and abstractions, I’m not sure how I could argue with them. I mean, in principle, I think you could come up with reasoning systems that distinguish between base-level reality and abstractions, as well as reasoning systems that don’t, that both make equally good empirical predictions. If there was some alien that didn’t make the distinction in their epistemology or ontology, I’m not sure how I could say, and support saying, “You’re wrong”.
I mean, I predict you could both make arbitrarily powerful agents with high predictive accuracy and high optimization-pressure that don’t distinguish between base-level reality and abstractions, and could do the same with agents that do make such a distinction. If both perform fine, them I’m not sure how I could argue that one’s “wrong”.
Is the existence of base-level reality subjective? Does this question even make sense?
We are probably just using different terminology and talking past each other. You agree that there is “something that’s real”. From my point of view, the term “base-level reality” refers to exactly that which is real, and no more. The abstractions we use do not necessarily correspond with base-level reality in any way at all. In particular if we are any of simulated entities, dreaming, full-sensory hallucinating, disembodied consciousness, or brains in jars with synthetic sensory input then we may not have any way to learn anything meaningful about base-level reality, but that does not preclude its existence because it is still certain that something exists.
None of the models are any sort of reality at all. At best, they are predictors of some sort of sensory reality (which may be base-level reality, or might not). It is possible that all of the models are actually completely wrong, as the agents have all been living in a simulation or are actually insane with false memories of correct predictions, etc.
The question makes sense, but the answer is the most emphatic NO that it is possible to give. Even in some hypothetical solipsistic universe in which only one bodiless mind exists and anything else is just internal experiences of that mind, that mind objectively exists.
It is conceivable to suppose a universe in which everything is a simulation in some lower-level universe resulting in an ordering with no least element to qualify as base-level reality, but this is still an objective fact about such a universe.
We do seem have have been talking past each other to some extent. Base-level reality, for course, exists if you define it to be “what really exists”.
However, I’m a little unsure about if that’s how people use the word. I mean, if someone asked me if Santa really exists, I’d say “No”, but if they asked if chairs really existed, I’d say “Yes”. That doesn’t seem wrong to me, but I thought our base-level reality only contained subatomic particles, not chairs. Does this mean the statement “Chairs really exist” is actually wrong? Or I am misinterpreting?
I’m also wondering how people justify thinking that models talking about things like chairs, trees, and anything other than subatomic particles don’t “really exist”. Is this even true?
I’m just imagining talking with some aliens with no distinction between base-level reality and what we would consider mere abstractions. For example, suppose the aliens knew about chairs, when they discovered quantum theory, they said say, “Oh! There are these atom things, and when they’re arrange in the right way, they cause chairs to exist!” But suppose they never distinguished between the subatomic particles being real and they chairs being real: they just saw both subatomic particles and chairs to both be fully real, and the correct arrangement of the former caused the latter to exist.
How could I argue with such aliens? They’re already making correct predictions, so I don’t see any way to show them evidence that disproves them. Is there some abstract reason to think models about thing like chairs don’t “really exist”?
The main places I’ve see the term “base-level reality” used are in discussions about the simulation hypothesis. “Base-level” being the actually real reality where sensory information tells you about interactions in the actual real world, as opposed to simulations where the sensory information is fabricated and almost completely independent of the rules that base-level reality follows. The abstraction is that the base-level reality serves as a foundation on which (potentially) a whole “tower” of simulations-within-simulations-within-simulations could be erected.
That semantic excursion aside, you don’t need to go to aliens to find beings that hold subatomic particles as being ontologically equivalent with chairs. Plenty of people hold that they’re both abstractions that help us deal with the world we live in, just at different length scales (and I’m one of them).
Well, even in a simulation, sensory information still tells you about interactions in the actual real world. I mean, based on your experiences in the simulation, you can potentially approximately infer the algorithm and computational state of the “base-level” computer you’re running in, and I believe those count as interactions in the “actual real world”. And if your simulation is sufficiently big and takes up a sufficiently large amount of the world, you could potentially learn quite a lot about the underlying “real world” just by examining your simulation.
That said, I still can’t say I really understand the concept of “base-level reality”. I know you said its what informs you about the “actual real world”, but this feels similarly confusing to me as defining base-level reality as “what really exists”. I know that reasoning and talking about things so abstract is hard and can easily lead to nonsense, but I’m still interested.
I’m curious about what even the purpose is of having an ontologically fundamental distinction between base-level reality and abstractions and whether it’s worth having. When asking, “Should I treat base-level reality and abstractions as fundamentally distinct?”, I think I good way to approximate this is by asking “Would I want an AI to reason as if its abstractions and base-level reality were fundamentally distinct?”
And I’m not completely sure they should. AIs, to reason practically, need to use “abstractions” in at least some of their models. If you want, you could have a special “This is just an abstraction” or “this is base-level reality” tag on each of your models, but I’m not sure what the benefit of this would be or what you would use it for.
Even without such a distinction, an AI would have both models that would be normally considered abstractions, as well as those of what you would think of as base-level reality, and would select which models to use based on their computational efficiency and the extent to which they are relevant and informative to the topic at hand. That sounds like a reasonable thing to do, and I’m not clear how ascribing fundamental difference to “abstractions” and “base-level” reality would do better than this.
If the AI talks with humans that use the phrase “base-level reality”, then it could potentially be useful for the AI to come up with an is-base-level-reality predicate in its world model in order to model things that answer, “When will this person call something base-level reality?” But such an predicate wouldn’t be treated as fundamentally different from any other predicate, like “Is a chair”.
Do you want an AI to be able to conceive of anything along the lines of “how correct is my model”, to distinguish hypothetical from actual, or illusion from substance?
If you do, then you want something that fits in the conceptual space pointed at by “base-level reality”, even if it doesn’t use that phrase or even have the capability to express it.
I suppose it might be possible to have a functioning AI that is capable of reasoning and forming models without being able to make any such distinctions, but I can’t see a way to do it that won’t be fundamentally crippled compared with human capability.
I’m interested in your thoughts on how the AI would be crippled.
I don’t think it would be crippled in terms of empirical predictive accuracy, at least. The AI could till come come up with all the low-level models like quantum physics, as well as keep the abstract ones like “this is what a chair is”, and then just use whichever it needs to make the highest possible predictive accuracy in a given circumstances.
If the AI is built to make and run quantum physics experiments, then in order to have high predictive accuracy is would need to learn and use an accurate model of quantum physics. But I don’t see why you would need a distinction between base-level reality and abstractions to do that.
The AI could still learn a sense of “illusion”. If the AI is around psychotic people who have illusions a lot, then I don’t see what’s stopping the AI from forming a model model saying, “Some people experience these things called ‘illusions’, and it makes them take the wrong action or wrong predictions as specified in <insert model of how people react to illusions”.
And I don’t see why the AI wouldn’t be able to consider the possibility that it also experiences illusions. For example, suppose the AI is in the desert and keeps seeing what looks like an oasis. But when the AI gets closer, it sees only sand. To have higher predictive accuracy in this situation, the AI could learn a (non-ontologically fundamental) “is-an-illusion” predicate.
Would the crippling me in terms of scoring highly on its utility function, rather than just predicting percepts? I don’t really see how this would be a problem. I mean, suppose you want an AI to make chairs. Then even if the AI lacked a notion of base-level reality, it could still learn an accurate models of how chairs work and how they are manufactured. Then the AI could have its utility function defined in terms of it’s notion of chairs to make it make chairs.
Could you give any specific example in which an AI using no ontologically fundamental notion of base-level reality would either make the wrong prediction or make the wrong action, in a way that would be avoided by using such a notion?
This feels like a bait-and-switch since you’re now talking about this in terms of an “ontologically fundamental” qualifier where previously you were only talking about “ontologically different”.
To you, does the phrase “ontologically fundamental” mean exactly the same thing as “ontologically different”? It certainly doesn’t to me!
It was a mistake for me to conflate “ontologically fundamental” and “ontologically different.
Still, I had in mind that they were ontologically different in some fundamental way. It was my mistake to merely use the word “different”. I had imagined that to make an AI that’s reasonable, it would actually make sense to hard-code some notion of base-level reality as well as abstractions, and to treat them differently. For example, you could have the AI have a single prior over “base-level reliaty”, then just come up with whatever abstractions that work well with predictively approximating the base-level reality. Instead it seems like the AI could just learn the concept of “base-level reality” like it would learn any other concept. Is this correct?
Also, in the examples I gave, I think the AI wouldn’t actually have needed a notion of base-level reality. The concept of a mirage is different from the concept of non-base-level reality. So is the concept of a mental illusion. Understanding both of those is different than understanding the concept of base-level reality.
If humans use the phrase “base-level reality”, I still don’t think it would be strictly necessary for an AI to have the concept. The AI could just know rules of the form, “If you ask a human if x is base-level reality, they will say ‘yes’ in the following situations...”, and then describe the situations.
So it doesn’t seem to me like the actual concept of “base-level reality” is essential, though it might be helpful. Of course, I might of course be missing or misunderstanding something. Corrections are appreciated.
Different in a narrow sense yes. “Refraction through heated air that can mislead a viewer into thinking it is reflection from water” is indeed different from “lifetime sensory perceptions that mislead about the true nature and behaviour of reality”. However, my opinion is that any intelligence that can conceive of the first without being able to conceive of the second is crippled by comparison with the range of human thought.
I don’t think you would actually need a concept of base-level reality to conceive of this.
First off, let me say that’s it seems pretty hard coming up with lifetime sensory precepts that would mislead about reality. Even if the AI was in a simulation, the physical implementation is part of reality. And the AI could learn about it. And from this, the AI could also potentially learn about the world outside the simulation. AIs commonly try to come up with the simplest (in terms of description length), most predictively accurate model of their percepts they can. And I bet the simplest models would involve having a world outside the simulation with specified physics, that would result in the simulations being built.
That said, lifetime sensory percepts can still mislead. For example, the simplest, highest-prior models that explain the AI’s percepts might say it’s in a simulation run by aliens. However, suppose the AI’s simulation actually just poofed into existed without a cause, and the rest of the world is filled with giant hats and no aliens. An AI, even without a distinction between base-level reality and abstractions, would still be able to come up with this model. If this isn’t a model involving percepts misleading you about the nature of reality, I’m not sure what is. So it seems to me that such AIs would be able to conceive of the idea of percepts misleading about reality. And the AIs would assign low probability to being in the all-hat world, just as they should.
The only means would be errors in the simulation.
Any underlying reality that supports Turing machines or any of the many equivalents can simulate every computable process. Even in the case of computers with bounded resources, there are corresponding theorems that show that the process being computed does not depend upon the underlying computing model.
So the only thing that can be discerned is that the underlying reality supports computation, and says essentially nothing about the form that it takes.
How can it conceive of the idea of percepts misleading about reality if it literally can’t conceive of any distinction between models (which are a special case of abstractions) and reality?
Well, the only absolute guarantee the AI can make is that the underlying reality supports computation.
But it can still probabilistically infer other things about it. Specifically, the AI knows not only that the underlying reality supports computation, but also that there was some underlying process that actually created the simulation it’s in. Even though Conway’s Game of Life can allow for arbitrary computation, many possible configurations of the world state would result in no AI simulations being made. The configurations that would result in AI simulations being made would likely involve some sort of intelligent civilization creating the simulations. So the AI could potentially predict the existence of this civilization and infer some things about it.
Regardless, even if the AI can’t infer anything else about outside reality, I don’t see how this is a fault of not having a notion of base-level reality. I mean, if you’re correct, then it’s not clear to me how an AI with a notion of base-level reality would do inferentially better.
I know we’ve been going back and forth a lot, but I think these are pretty interesting things to talk about, so I thank you for the discussion.
It might help if you try to describe a specific situation in which the AI makes the wrong prediction or takes the wrong action for its goals. This could help be better understand what you’re thinking about.
At this point I’m not sure there’s much point in discussing further. You’re using words in ways that seem self-contradictory to me.
You said “the AI could still consider the possibility that the world is composed of [...]”. Considering a possibility is creating a model. Models can be constructed about all sorts of things: mathematical statements, future sensory inputs, hypothetical AIs in simulated worlds, and so on. In this case, the AI’s model is about “the world”, that is to say, reality.
So it is using a concept of model, and a concept of reality. It is only considering the model as a possibility, so it knows that not everything true in the model is automatically true in reality and vice versa. Therefore it is distinguishing between them. But you posited that it can’t do that.
To me, this is a blatant contradiction. My model of you is that you are unlikely to post blatant contradictions, so I am left with the likelihood that what you mean by your statements is wholly unlike the meaning I assign to the same statements. This does not bode well for effective communication.
Yeah, it might be best to wrap up the discussion. It seems we aren’t really understanding what the other means.
Well, I can’t say I’m really following you there. The AI would still have a notion of reality. It just would consider abstractions like chairs and tables to be part of reality.
There is one thing I want to say though. We’ve been discussing the question of if a notion of base-level reality is necessary to avoid severe limitations in reasoning ability. And to see why I think it’s not, just consider regular humans. They often don’t have a distinction between base-level reality and abstractions. And yet, they can still reason about the possibility of life-long illusions as well as function well to accomplish their goals. And if you taught someone the concept of “base-level reality”, I’m not sure it would help them much.
It sounds like you’re using very different expectations for those questions, as opposed to the very rigorous interrogation of base reality. ‘Does Santa exist?’ and ‘does that chair exist?’ are questions which (implicitly, at least) are part of a system of questions like ‘what happens if I set trip mines in my chimney tonight?’ and ‘if I try to sit down, will I fall on my ass?’ which have consequences in terms of sensory input and feedback. You can respond ‘yes’ to the former, if you’re trying to preserve a child’s belief in Santa (although I contend that’s a lie) and you can truthfully answer ‘no’ to the latter if you want to talk about an investigation of base reality.
Of course, if you answer ‘no’ to ‘does that chair exist?’ your interlocutor will give you a contemptful look, because that wasn’t the question they were asking, and you knew that, and you chose to answer a different question anyway.
I choose to think of this as different levels of resolution, or as varying bucket widths on a histograph. To the question ‘does Jupiter orbit the Sun?’ you can productively answer ‘yes’ if you’re giving an elementary school class a basic lesson on the structure of the solar system. But if you’re trying to slingshot a satellite around Ganymede, the answer is going to be no, because the Solar-Jovian barycenter is way outside the solar corona, and at the level you’re operating, that’s actually relevant.
Most people don’t use the words ‘reality’ or ‘exist’ in the way we’re using it here, not because people are idiots, but because they don’t have a coherent existential base for non-idiocy, and because it’s hard to justify the importance of those questions when you spend your whole life in sensory reality.
As to the aliens, well, if they don’t distinguish between base level reality and abstractions, they can make plenty of good sensory predictions in day-to-day life, but they may run into some issues trying to make predictions in high-energy physics. If they manage to do both well, it sounds like they’re doing a good job operating across multiple levels of resolution. I confess I don’t have a strong grasp on the subject, or on the differences between a model being real versus not being real in terms of base reality, I’m gonna wait on JBlack’s response to that.
Relevant links (which you’ve probably already read):
How an Algorithm Feels From the Inside, Eliezer Yudkowsky
The Categories Were Made for Man, not Man for the Categories, Scott Alexander
Ontological Remodeling, David Chapman
The correctness of that post has been disputed; for an extended rebuttal, see “Where to Draw the Boundaries?” and “Unnatural Categories Are Optimized for Deception”.
Thanks Zack!
I generally agree with the content of the articles you linked, and that there are different notions of “really exist”. The issue is, I’m still not sure what “base-level reality” means. JBlack said it was what “really exists”, but since JBlack seems to be using a notion of “what really exists” that’s different from the one people normally use, I’m not really sure what it means.
In the end, you can choose to define “what really exists” or “base-level reality” however you want, but I’m still wondering about what people normally take them to mean.
I try to avoid using the word ‘really’ for this sort of reason. Gets you into all sorts of trouble.
(a) JBlack is using a definition related to simulation theory, and I don’t know enough about this to speculate too much, but it seems to rely on a hard discontinuity between base and sensory reality.
(b) Before I realized he was using it that way, I thought the phrase meant ‘reality as expressed on the most basic level yet conceivable’ which, if it is possible to understand it, explodes the abstractions of higher orders and possibly results in their dissolving into absurdity. This is a softer transition than the above.
(c) I figure most people use ‘really exist’ to refer to material sensory reality as opposed to ideas. This chair exists, the Platonic Idea of a chair does not. The rule with this sort of assumption is ‘if I can touch it, or it can touch me, it exists’ for a suitably broad understanding of ‘touch.’
(d) I’ve heard some people claim that the only things that ‘really exist’ are those you can prove with mathematics or deduction, and mere material reality is a frivolity.
(e) I know some religious people believe heavily in the primacy of God (or whichever concept you want to insert here) and regard the material world as illusory, and that the afterlife is the ‘true’ world. You can see this idea everywhere from the Kalachakra mandala to the last chapter of the Screwtape letters.
I guess the one thing uniting all these is that, if it were possible to take a true Outside View, this is what you would see; a Platonic World of ideas, or a purely material universe, or a marble held in the palm of God, or a mass of vibrating strings (or whatever the cool kids in quantum physics are thinking these days) or a huge simulation of any of the above instantiated on any of the above.
I think most people think in terms of option c, because it fits really easily into a modern materialist worldview, but the prevalence of e shouldn’t be downplayed. I’ve probably missed some important ones.
I had made a post proposing a new alignment technique. I didn’t get any responses, but it still seems like a reasonable idea to me, so I’m interested in hearing what others think of it. I think the basic idea of the post, if correct, could be useful for future study. However, I don’t want to waste time doing this if the idea is unworkable for a reason I hadn’t thought of.
(If you’re interested, please read the post before reading below.)
Of course, the idea’s not a complete solution to alignment, and things have a risk of going catastrophically wrong due to other problems, like unreliable reasoning. But it still seems to me that it’s potentially helpful for outer alignment and corrigability.
If the humans actually directly answer any query about the desirability of an outcome, then it’s hard for me to see a way this system wouldn’t be outer-aligned.
Now, consulting humans every time results in a very slow objective function. Most optimization algorithms I know of rely on huge numbers of queries to the objective function, so using these algorithms with humans manually implementing the objective function would be infeasible. However, I don’t see anything in principle impossible with coming up with an optimization algorithm that scores well on its objective function even if that function is extremely slow. Even if the technique I described to do in the post this was wrong, I haven’t seen anyone looking into this, so it doesn’t seem clearly unworkable to me.
Even if this does turn out to be intractable, I think the basic motivation of my post still has the potential to be useful. The main motivation of my post is to have a hard-coded method of querying humans before making major strategic decisions and to update its beliefs about what is desirable with their responses. But that is a technique that could be used in other AI systems as well. It wouldn’t solve the everything, of course, but it could provide an additional level of safety. I’m not sure if this idea has been discussed before.
I also have yet to find anything seriously problematic about the method I did provided to optimize with limited calls to the objective function. There could of course be some I haven’t thought of, though.
I found what seems to be a potentially dangerous false-negative in the most popular definition of optimizer. I didn’t get a response, so I would appreciate feedback on if it’s reasonable. I’ve been focusing on defining “optimizer”, so I think feedback would help me a lot. You can see my comment here .
I’ve realized I’m somewhat skeptical of the simulation argument.
The simulation argument proposed by Bostrom argued, roughly, that either almost exactly all Earth-like worlds don’t reach a posthuman level, almost exactly all such civilizations don’t go on to build many simulations, or that we’re almost certainly in a simulation.
Now, if we knew that the only two sorts of creatures that experience what we experience are either in simulations or the actual, original, non-simulated Earth, then I can see why the argument would be reasonable. However, I don’t know how we could know this.
For example, consider zoos: Perhaps advanced aliens create “zoos” featuring humans in an Earth-like world, for their own entertainment or other purposes. These wouldn’t necessarily be simulations of any actual other planet, but might merely have been inspired by actual planets. Similarly, lions in the zoo are similar to lions in the wild, and their enclosure features plants and other environmental feature similar to what they would experience in the wild. But I wouldn’t call lions in zoos simulations of wild lions, even if the developed parts where humans could view them was completely invisible to them and their enclosure was arbitrarily large.
Similarly, consider games: Perhaps aliens create games or something like them set in Earth-like worlds that aren’t actually intended to be simulations of any particle world. Similarly, human fantasy RPGs often have a medieval theme, so maybe aliens would create games set in a modern-Earth-like world, without having in mind any actual planet to simulate.
Now, you could argue that in an infinite universe, these things are all actually simulations, because there must be some actual, non-simulated world that’s just like the “zoo” or game. However, by that reasoning, you could argue that a rock you pick up is nothing but a “rock simulation” because you know there is at least one other rock in the universe with the exact same configuration and environment as the rock you’re holding. That doesn’t seem right to me.
Similarly, you could say, then, that I’m actually in a simulation right now. Because even if I’m in the original Earth, there is some other Chantiel in the universe in a situation identical to my current one, who is logically constrained to do the same thing I do, so thus I am a simulation of her. And my environment is thus a simulation of hers.
I think you should reread the paper.
This falls under either #1 or #2, since you don’t say what human capabilities are in the zoo or explain how exactly this zoo situation matters to running simulations; do we go extinct at some time long in the future when our zookeepers stop keeping us alive (and “go extinct before reaching a “posthuman” stage”), having never become powerful zookeeper-level civs ourselves, or are we not permitted to (“extremely unlikely to run a significant number of simulations”)?
This is just fork #3: “we are in a simulation”. At no point does fork #3 require it to be an exact true perfect-fidelity simulation of an actual past, and he is explicit that the minds in the simulation may be only tenuously related to ‘real’/historical minds; if aliens would be likely to create Earth-like worlds, for any reason, that’s fine because that’s what necessary, because we observe an Earth-like world (see the indifference principle section).
Thanks for the response, Gwern.
Oh, I guess I missed this. Do you know where Bostrom said the “simulations” can only tenuously related to real minds? I was rereading the paper but didn’t see mention of this. I’m just surprised, because normally I don’t think zoo-like things would be considered simulations.
In case I didn’t make it clear, I’m saying that even if a significant proportion of civilization reach a post-human stage and a significant proportion of these run simulations, there would still potentially be a non-small chance of actually not being in a simulation an instead being in a game or zoo. For example, suppose each post-human civilization makes 100 proper simulations and 100 zoos. Then even if parts 1 and 2 of the simulation argument are true, you still have a 50% chance of ending up in a zoo.
Does this make sense?
[edited]
By “real”, do you mean non-simulated? Are you saying that even if 99% of Chantiels in the universe are in simulations, then I should still believe I’m not in one? I don’t know how I could convince myself of being “real” if 99% of Chantiels aren’t.
Do you perhaps mean I should act as if I were non-simulated, rather than literally being non-simulated?
[edited]
Interesting. When you say “fake” versions of myself, do you mean simulations? If so, I’m having a hard time seeing how that could be true. Specifically, what’s wrong about me thinking I might not be “real”? I mean, if I though I was in a simulation, I think I’d do pretty much the same things I would do if I thought I wasn’t in a simulation. So I’m not sure what the moral harm is.
Do you have any links to previous discussions about this?
Interesting.
I am also skeptical of the simulation argument, but for different reasons.
My main issue is: the normal simulation argument requires violating the Margolus–Levitin theorem[1], as it requires that you can do an arbitrary amount of computation[2] via recursively simulating[3].
This either means that the Margolus–Levitin theorem is false in our universe (which would be interesting), we’re a ‘leaf’ simulation where the Margolus–Levitin theorem holds, but there’s many universes where it does not (which would also be interesting), or we have a non-zero chance of not being in a simulation.
This is essentially a justification for ‘almost exactly all such civilizations don’t go on to build many simulations’.
A fundamental limit on computation: ≤6∗1033operations/second/Joule
Note: I’m using ‘amount of computation’ as shorthand for ‘operations / second / Joule’. This is a little bit different than normal, but meh.
Call the scaling factor—of amount of computation necessary to simulate X amount of computation - C. So e.g.C=0.5 means that to simulate 1 unit of computation you need 2 units of computation. If C≥1, then you can violate the Margolus–Levitin theorem simply by recursively sub-simulating far enough. If C<1, then a universe that can do X computation can simulate no more than CX total computation regardless of how deep the tree is, in which case there’s at least a 1−C chance that we’re in the ‘real’ universe.
No, it doesn’t, any more than “Godel’s theorem” or “Turing’s proof” proves simulations are impossible or “problems are NP-hard and so AGI is impossible”.
There are countless ways to evade this impossibility argument, several of which are already discussed in Bostrom’s paper (I think you should reread the paper) eg. simulators can simply approximate, simulate smaller sections, tamper with observers inside the simulation, slow down the simulation, cache results like HashLife, and so on. (How do we simulate anything already...?)
All your Margolus-Levitin handwaving can do is disprove a strawman simulation along the lines of a maximally dumb pessimal 1:1 exact simulation of everything with identical numbers of observers at every level.
I should probably reread the paper.
That being said:
I don’t follow your logic here, which probably means I’m missing something. I agree that your latter cases are invalid logic. I don’t see why that’s relevant.
This does not evade this argument. If nested simulations successively approximate, total computation decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
This does not evade this argument. If nested simulations successively simulate smaller sections, total computation decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
This does not evade this argument. If nested simulations successively tamper with observers, this does not affect total computation—total computation still decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
This does not evade this argument. If nested simulations successively slow down, total computation[1] decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
This does not evade this argument. Using HashLife, total computation still decreases exponentially (or the Margolus–Levitin theorem doesn’t apply everywhere).
By accepting a multiplicative slowdown per level of simulation in the infinite limit[2], and not infinitely nesting.
See note 2 in the parent: “Note: I’m using ‘amount of computation’ as shorthand for ‘operations / second / Joule’. This is a little bit different than normal, but meh.”
You absolutely can, in certain cases, get no slowdown or even a speedup by doing a finite number of levels of simulation. However, this does not work in the limit.
No, it evades the argument by showing that what you take as a refutation of simulations is entirely compatible with simulations. Many impossibility proofs prove an X where people want it to prove a Y, and the X merely superficially resembles a Y.
No, it evades the argument by showing that what you take as a refutation of simulations is entirely compatible with simulations. Many impossibility proofs prove an X where people want it to prove a Y, and the X merely superficially resembles a Y.
No, it...
No, it...
No, it...
Reminder: you claimed:
The simulation argument does not require violating the M-L theorem to the extent it is superficially relevant and resembles an impossibility proof of simulations.
Are you saying that we can’t be in a simulation because our descendants might go on to build a large number of simulations themselves, requiring too many resources in the base reality? But I don’t think that weakens the argument very much, because we aren’t currently in a position to run a large number of simulations. Whoever is simulating us can just turn off/reset the simulation before that happens.
Said argument applies if we cannot recursively self-simulate, regardless of reason (Margolus–Levitin theorem, parent turning the simulation off or resetting it before we could, etc).
In order for ‘almost all’ computation to be simulated, most simulations have to be recursively self-simulating. So either we can recursively self-simulate (which would be interesting), we’re rare (which would also be interesting), or we have a non-zero chance we’re in the ‘real’ universe.
The argument is not that generic computations are likely simulated, it’s about our specific situation—being a newly intelligent species arising in an empty universe. So simulationists would take the ‘rare’ branch of your trilemma.
Interesting.
If you’re stating that generic intelligence was not likely simulated, but generic intelligence in our situation was likely simulated...
Doesn’t that fall afoul of the mediocrity principle applied to generic intelligence overall?
(As an aside, this does somewhat conflate ‘intelligence’ and ‘computation’; I am assuming that intelligence requires at least some non-zero amount of computation. It’s good to make this assumption explicit I suppose.)
Sure. I just think we have enough evidence to overrule the principle, in the form of sensory experiences apparently belonging to a member of a newly-arisen intelligent species. Overruling mediocrity principles with evidence is common.
I had recently posted a question asking about if iterated amplification was actually more powerful than mere mimicry and arguing that it was not. I had thought I was making a pretty significant point, but the post attracted very little attention. I’m not saying this is a bad thing, but I’m not really sure why it happened, so I would appreciate some insight about how I can contribute more usefully.
Iterated amplification seems to be the leading proposal for created aligned AI, so I thought a post arguing against it, if correct, would be a useful contribution. Perhaps there is some mistake in my reasoning, but I have yet to see any mentioned. It’s possible that people have already thought of this consideration and posted about it, but I have yet to find any, so I’m not really sure.
Would it have been better posting it as an actual post instead of framing it as a question? I have some more to say to argue for mimicry than I mentioned in the question; would it be worthwhile for me to add it and then post this as a non-question post?
It’s true that most problems could be delegated to uploads, and any specific design is a design that the uploads could come up with just as well or better. The issue is that we don’t have uploads, and most plans to get them before AGI involve the kind of hypothetical AI know-how that might easily be used to build an agentic AGI, the risk the uploads are supposed to resolve.
Thus the “humans” of a realistic implementation of HCH are expected to be vague imitations of humans that only function somewhat sensibly in familiar situations and for a short time, not fully functional uploads, and most of the point of the specific designs is to mitigate the imperfection of their initial form, to make something safe/useful out of this plausibly feasible ingredient. One of the contentious points about this is whether it’s actually possible to build something useful (let alone safe) out of such imperfect imitations, even if we build a large system out of them that uses implausible amount of resources. This is what happens with an HCH that can use an infinite number of actual uploads (exact imitations) that are still restricted to an hour or a day of thinking/learning (and then essentially get erased, that is can’t make further use of the things they learned). Designing something safe/useful in the exact imitation HCH setting is an easier problem than doing so in a realistic setting, so it’s a good starting point.
Thanks for the response. To be clear, when discussing mimics, I did not have in mind perfect uploads of people. Instead, they could indeed be rather limited imitations. For example, an AI designing improvements to itself doesn’t need to actually have a generally faithful imitation of human behavior. Instead, it could just know a few things, like, “make this algorithm score better on this thing without taking over the world”.
Still, I can see how, when it comes to especially limited imitations, iterated amplification could be valuable. This seems especially true if the imitations are unreliable in even narrow situations. It would be problematic is an AI tasked with designing powerful AI didn’t get the “act corrigibly, and don’t take over the world” part reliably right.
I’ve been thinking about what you’ve said about iterated amplification, and there are some things I’m unsure of. I’m still rather skeptical of the benefit of iterated amplification, so I’d really appreciate a response.
You mentioned that iterated amplification can be useful when you have only very limited, domain-specific models of human behavior, where such models would be unable to come up with the ability to create code. However, there are two things I’m wondering about. The first is that it seems to me that, for a wide range of situations, you need a general and robustly accurate model of human behavior to perform well. The second is that, even if you don’t have a general model of human behavior, it seems to me that it’s sufficient to only have one amplification step, which I suppose isn’t iterated amplification. And the big benefit to avoiding iterated amplification is that iterated amplification results in exponential decreases in reliability from compounding errors on each distillation step, but with a single amplification step, this exponential decrease in reliability wouldn’t occur.
For the first topic, suppose your AI is trained to make movies. I think just about every human value is relevant to the creation of movies, because humans usually like movies with a happy ending, and to make an ending happy you need to understand what humans consider a “happy ending”.
Further, you would need an accurate model of human cognitive capabilities. To make a good movie, it needs to be easy enough for humans to understand. But sometimes it also shouldn’t be too easy, because that can remove the mystery of it.
And the above is not just true for movies: I think creating other forms of entertainment would involve the same things as above.
Could you do the above with only some domain-limited model of what counts as confusing or a good or bad ending in the context of movies? It’s not clear to me that this is possible. Movies involve a very wide variety of situations, and you need to keep things understandable and resulting in a happy ending in all of those circumstances. I don’t see how could you robustly do the above without a general model of what people people find confusing or otherwise bad.
Further, whenever an AI needs to explain something to humans, it seems to me that it’s important that it has an accurate model of what humans can understand and not understand. Is there any way to do this with purely domain-specific models rather than with a general understanding of what people find confusing? It’s not clear to me that this is possible. For example, imagine an AI that needs to explain many different things. Maybe it’s tasked with creating learning materials or making the news. With such a broad category of things the AI needs to explain, it’s really not clear to me how an AI could do this without a general model of what makes things confusing or not.
Also more generally, it seems to me that whenever the AI is involved with human interaction in novel circumstances, it will need an accurate model of what people like and dislike. For example, consider an AI tasked with coming up with a plan for human workers. Doing so has the potential to involve an extremely wide range of values. For example, humans generally value novelty, autonomy, not feeling embarrassed, not being bored, not being overly pressured, not feeling offended, and not seeing disgusting or ugly things.
Could you have an AI learn to avoid things things with only domain-specific models, rather than a general understanding of what people value and disvalue? I’m not sure how to do this. Maybe you could learn models that work for reflecting people’s values in limited circumstances. However, I think an essential component of intelligence is to come up with novel plans involving novel situations. And I don’t see how an agent could do this without a general understanding of values. For example, the AI might create entire new industries, and it would be important that any human workers in those industries would have satisfactory conditions.
Now, for the second topic: using amplification without iteration.
First off, I want to note that, even without a general model of humans, it’s still not really clear to me that you need any amplification at all. As I’ve said before, even mere human imitation the potential to result in extremely high intelligence simply by doing the same things humans do, but much faster. As I mentioned previously, consider the human output to be published research papers from top researchers, and the AI is tasked with mimicking it. Then the AI could take the research papers as the human output and use this to create future papers but far far faster.
But suppose you do still need amplification. Then I don’t see why one amplification step wouldn’t be enough. I think that if you put together a sufficiently large number of intelligent humans and give them unlimited time to think, they’d be able to solve pretty much anything that iterated amplification with HCH would be able to solve. So, instead of having multiple amplification and distillation steps, you could instead just have one very large amplification step that would involve a large enough number of humans models interacting that it could solve pretty much anything.
If the amplification step involve a sufficiently large number of people, you might be concerned that it would be intractable to emulate them all.
I’m not sure if this would be a problem. Consider again the AI designed to mimic the research papers of top researchers. I think that often a small number of top researchers are responsible for a large proportion of research progress, so the AI could potentially just see that output of the top, say, 100 or 1000 researchers working together would be. And the AI would potentially be able to produce the outputs of each researcher with far less computation. That sounds plausibly like enough to me.
But suppose that’s not enough, and emulating every human individually during the amplification step is intractable. Then here’s how I think you can get around this: train not only a human model, but also a system of approximating the output of an expensive computation with much lower computational cost. Then, for the amplification step, you can define an computing involving an extremely large number of interacting emulated humans, and then allow the approximation system to come up with approximations to this without needing to directly emulate every human.
To give a sense of how this might work, note that in a computation, often a small amount of the parts of the computation account for a large part of the output. For example, if you are trying to approximate a computation about gravity, commonly only the closest, most massive objects have significant gravitational effect on something, and you can ignore the rest. Similarly, rather than simulate individual atoms, it’s much more efficient to come up with groups of large number of atoms, and consider their effect as a group. The same is true for other computations involving many small components.
To emulate humans, you could potentially do the same things as you would when simulating gravity. Specifically, an AI may be able to consider groups of humans and infer what the final output of that group will be, without actually needing to emulate each one individually. Further, for very challenging topics, many people may fail to contribute anything to the final result, so the could potentially avoid emulating them at all.
So I still can’t really see the benefit of iterated amplification. Of course, I could be missing something, so I’m interesting in hearing what you think.
One potential problem is that it might be hard to come up with good training data for an arbitrary-function-approximator, since finding the exact output of expensive functions would be expensive. However, it’s not clear to me how big of a problem this would be. As I’ve said before, even the output of a 100 or 1000 humans interacting could potentially be all the AI ever needs, and with sufficient fast approximations of individual humans, this could be tractable to create training data for.
Further, I bet the AI could learn a lot about arbitrary-function approximation just by training on approximating functions that are already reasonably fast the compute. I think the basic techniques to quickly approximating functions are what I mentioned before: come up with abstract objects that involve groups of individual components, and know when to stop performing the computation on a certain object because it’s clear it will have little effect on the final result.
Amplification induces a dynamic in the model space, it’s a concept of improving models (or equivalently in this context, distributions). This can be useful when you don’t have good datasets, in various ways.
For robustness, you have a dataset that’s drawn from the wrong distribution, and you need to act in a way that you would’ve acted if it was drawn from the correct distribution. If you have an amplification dynamic that moves models towards few attractors, then changing the starting point (training distribution compared to target distribution) probably won’t matter. At that point the issue is for the attractor to be useful with respect to all those starting distributions/models. This doesn’t automatically make sense, comparing models by usefulness doesn’t fall out of the other concepts.
For chess, you’d use the idea of winning games (better models are those that win more, thus amplification should move models towards winning), which is not inherent in any dataset of moves. For AGI, this is much more nebulous, but things like reflection (thinking about a problem longer, conferring with others, etc.) seem like a possible way of bootstrapping a relevant amplification, if goodharting is kept in check throughout the process.
Interesting. Do you have any links discussing this? I read Paul Christiano’s post on reliability amplification, but couldn’t find mention of this. And, alas, I’m having trouble finding other relevant articles online.
Yes, that’s true. I’m not claiming that iterated amplification doesn’t have advantages. What I’m wondering is if non-iterated amplification is a viable alternative. I haven’t seen non-iterated amplification proposed before for creating algorithm AI. Amplification without iteration has the disadvantage that it may not have the attractor dynamic iterated amplification has, but it also doesn’t have the exponentially increasing unreliability iterated amplification has. So, to me at least, it’s not clear to me if pursuing iterated amplification is a more promising strategy than amplification without iteration.
For me, the interesting thing about IDA is not capability amplification like self-play, but an attitude towards generation of datasets as a point of intervention into the workings of an AI for all kinds of improvements. So we have some AI that we want to make better in some respect, and the IDA methodology says that to do that, we should employ the AI to generate a dataset for retraining a new version of it that’s better than the original dataset in that respect. Then we retrain the AI using the new dataset. So amplification unpackages the AI into the form of an appropriately influenced dataset, and then learning repackages it for further use.
If the impact measure was poorly implemented, then I think such an impact-reducing AI could indeed result in the world turning out that way. However, note that the technique in the paper is intended to, for a very wide range of variables, make the world if the AI wasn’t turned on as similar as possible to what it would be like if it was turned on. So, you can potentially avoid the AI-controlled-drone scenario by including the variable “number of AI-controlled drones in the world” or something correlated with it, as these variables could be have quite different values between a possible world in which the AI was turned on and a possible world in which the AI wasn’t.
Coming up with a set of variables wide enough to include that might seem a little difficult, but I’m not sure it would be. One option is to, for every definable function of the world, include the value of the function as one of the variables the AI considers and tries to avoid interfering with.
I have some concerns about an impact measure proposed here. I’m interested on working on impact measures, and these seem like very serious concerns to me, so it would be helpful seeing what others think about them. I asked Stuart, one of the authors, about these concerns, but he said it was too busy to work on dealing with them.
First, I’ll give a basic description of the impact measure. Have your AI be turned on from some sort of stochastic process that may or may not result in the AI being turned on. For example, consider sending a photo through a semi-silvered mirror. Let X be the statement, “The AI was turned on”. For a wide range of variables V, try to make the probability distribution P(Vi|X) as similar as possible to the probability distribution P(Vi|¬X).
Here are my concerns:
It’s not clear to me that we can find any verifiably-stochastic process. Some interpretations of quantum physics say that there are some latent variables that determine the outcome of seemingly stochastic processes. In this case, just knowing the AI was or was not turned on may have thus implications on latent variables that possibly have large effects on the world.
Further, I can’t find a way to define X and not-X that works, even if you find a truly stochastic process. You could try to specify X as, “the photon makes it through the semi-silvered mirror” and specify not-X as the negation of this. But conditioning on this and nothing else is insufficient. To see why, not that X and not-X don’t specify anything about the mechanics about how the photon and mirror interact. For example, X, above doesn’t rule out universes in which there is no quantum physics as we know it, and the mirror and photon interact in some completely different way that makes the photon more likely to pass through than it would in our universe. Thus, conditioning on X would make it more likely the AI would be in that universe, which could have large effects on the values of variables.
You could try to get around this by also including in X and not-X mechanics describing how the photon and semi-silvered mirror interact. The issue is, we don’t actually know exactly how the photon and semi-silvered mirror interact. We have models of quantum physics that have extremely high predictive accuracy in practice, but we still don’t have a theory of everything. So we could include in X and not-X our current understanding of quantum physics, but that understanding might be somewhat wrong.
This could potentially be dangerous. Suppose the AI has invented some technology that, according to the actual, correct theory of everything, would destroy the world. But according to our slightly wrong model of quantum physics, it would do nothing in particular. Then the AI, having X and not-X condition on our current model of quantum physics, would see no problem with creating the technology.
Further still, I still have the concern about what to do if the AI in fact doesn’t turn on. You could decide to just turn on some AI identical to it, and keep trying turning on AIs until you get an one to turn on. But then the world would pretty much be identical whether X or not-X turns out to be true, because the other AI would do the same thing it would have.
You could instead turn on some AI that uses the same reduced-impact technique, but is different from the current AI in some way that significantly effects the world. However, if you do this, then I think we might arrive in a very bad equilibrium in the AIs’ behavior. Specifically, suppose all AIs decide to destroy the world if they’re turned on. Then that means that if one of the AIs isn’t turned on, some other AI will be turned on, which would also lead to the world being destroyed. If all the destroyed worlds look pretty much the same, then the AI deciding to destroy the world would have low impact according to your definition.
So suppose instead that if the AI doesn’t turn on, we rely on some other AI safety technique. Suppose this AI safety technique successfully makes safe AI. In this case, we had no need to create the reduced-impact AI you described, because the other AI design would have worked fine. Suppose the alternative AI safety technique in fact doesn’t work, and the result of using it is that our world gets destroyed. In that case, the conditioning on not-X would imply the world would be destroyed. So if an AI tries to make the world conditioning on X similar to it conditioning on not-X, then it would again act to destroy the world.
Luke responded to one of my points, saying, “Yes. [what to do if the AI doesn’t get turned on] needs to be handled with care. A time-delay might be sufficient for this, though. My response was:
I’ve come up with a system of infinite ethics intended to provide more reasonable moral recommendations than previously-proposed ones. I’m very interested in what people think of this, so comments are appreciated. I’ve made a write-up of it below.
One unsolved problem in ethics is that aggregate consquentialist ethical theories tend to break down if the universe is infinite. An infinite universe could contain both an infinite amount of good and an infinite amount of bad. If so, you are unable to change the total amount of good or bad in the universe, which can cause aggregate consquentialist ethical systems to break.
There has been a variety of methods considered to deal with this. However, to the best of my knowledge all proposals either have severe negative side-effects or are intuitively undesirable for other reasons.
Here I propose a system of aggregate consquentialist ethics intended to provide reasonable moral recommendations even in an infinite universe.
It is intended to satisfy the desiderata for infinite ethical systems specified in Nick Bostrom’s paper, “Infinite Ethics”. These are:
Avoiding distortions. Some remedies introduce subtle distortions into moral deliberation
I have yet to find a way in which my system fails any of the above desiderata. Of course, I could have missed something, so feedback is appreciated.
My ethical system
First, I will explain my system.
My ethical theory is, roughly, “Make the universe one agents would wish they were born into”.
By this, I mean, suppose you had no idea which agent in the universe it would be, what circumstances you would be in, or what your values would be, but you still knew you would be born into this universe. Consider having a bounded quantitative measure of your general satisfaction with life, for example, a utility function. Then try to make the universe such that the expected value of your life satisfaction is as high as possible if you conditioned on you being an agent in this universe, but didn’t condition on anything else. (Also, “universe” above means “multiverse” if this is one.)
In the above description I didn’t provide any requirement for the agent to be sentient or conscious. If you wish, you can modify the system to give higher priority to the satisfaction of agents that are sentient or conscious, or you can ignore the welfare of non-sentient or non-conscious agents entirely.
It’s not entirely clear how to assign a prior over situations in the universe you could be born into. Still, I think it’s reasonably intuitive that there would be some high-entropy situations among the different situations in the universe. This is all I assume for my ethical system.
Now I’ll give some explanation of what this system recommends.
Suppose you are considering doing something that would help some creature on Earth. Describe that creature and its circumstances, for example, as “<some description of a creature> in an Earth-like world with someone who is <insert complete description of yourself>”. And suppose doing so didn’t cause any harm to other creatures. Well, there is non-zero prior probability of an agent, having no idea what circumstances it will be in the universe, ending up in circumstances satisfying that description. By choosing to help that creature, you would thus increase the expected satisfaction of any creature in circumstances that match the above description. Thus, you would increase the overall expected value of the life-satisfaction of an agent knowing nothing about where it will be in the universe. This seems reasonable.
With similar reasoning, you can show why it would be beneficial to also try to steer the future state of our accessible universe in a positive direction. An agent would have nonzero probability of ending up in situations of the form, “<some description of a creature> that lives in a future colony originating from people from an Earth-like world that features someone who <insert description of yourself>”. Helping them would thus increase an agent’s prior expected life-satisfaction, just like above. This same reasoning can also be used to justify doing acausal trades to help creatures in parts of the universe not causally accessible.
The system also values helping as many agents as possible. If you only help a few agents, the prior probability of an agent ending up in situations just like those agents would be low. But if you help a much broader class of agents, the effect on the prior expected life satisfaction would be larger.
These all seem like reasonable moral recommendations.
I will now discuss how my system does on the desiderata.
Infinitarian paralysis
Some infinite ethical systems result in what is called “infinitarian paralysis”. This is the state of an ethical system being indifferent in its recommendations in worlds that already have infinitely large amounts of both good and bad. If there’s already an infinite amount of both good and bad, then our actions, using regular cardinal arithmetic, are unable to change the amount of good and bad in the universe.
My system does not have this problem. To see why, remember that my system says to maximize the expected value of your life satisfaction given you are in this universe but not conditioning on anything else. And the measure of life-satisfaction was stated to be bounded, say to be in the range [0, 1]. Since any agent can only have life satisfaction in [0, 1], then in an infinite universe, the expected value of life satisfaction of the agent must still be in [0, 1]. So, as long as a finite universe doesn’t have expected value of life satisfaction to be 0, then an infinite universe can at most only have finitely more moral value than it.
To say it another way, my ethical system provides a function mapping from possible worlds to their moral value. And this mapping always produces outputs in the range [0, 1]. So, trivially, you can see the no universe can have infinitely more moral value than another universe with non-zero moral value.∞ just isn’t in the domain of my moral value function.
Fanaticism
Another problem in some proposals of infinite ethical systems is that they result in being “fanatical” in efforts to cause or prevent infinite good or bad.
For example, one proposed system of infinite ethics, the extended decision rule, has this problem. Let g represent the statement, “there is an infinite amount of good in the world and only a finite amount of bad”. Let b represent the statement, “there is an infinite amount of bad in the world and only a finite amount of good”. The extended decision rule says to do whatever maximizes P(g) - P(b). If there are ties, ties are broken by choosing whichever action results in the most moral value if the world is finite.
This results in being willing to incur any finite cost to adjust the probability of infinite good and finite bad even very slightly. For example, suppose there is an action that, if done, would increase the probability of infinite good and finite bad by 0.000000000000001%. However, if it turns out that the world is actually finite, it will kill every creature in existence. Then the extended decision rule would recommend doing this. This is the fanaticism problem.
My system doesn’t even place any especially high importance in adjusting the probabilities of infinite good and or infinite bad. Thus, it doesn’t have this problem.
Preserving the spirit of aggregate consequentialism
Aggregate consequentialism is based on certain intuitions, like “morality is about making the world as best as it can be”, and, “don’t arbitrarily ignore possible futures and their values”. But finding a system of infinite ethics that preserves intuitions like these is difficult.
One infinite ethical system, infinity shades, says to simply ignore the possibility that the universe is infinite. However, this conflicts with our intuition about aggregate consequentialism. The big intuitive benefit of aggregate consequentialism is that it’s supposed to actually systematically help the world be a better place in whatever way you can. If we’re completely ignoring the consequences of our actions on anything infinity-related, this doesn’t seem to be respecting the spirit of aggregate consequentialism.
My system, however, does not ignore the possibility of infinite good or bad, and thus is not vulnerable to this problem.
I’ll provide another conflict with the spirit of consequentialism. Another infinite ethical system says to maximize the expected amount of goodness of the causal consequences of your actions minus the amount of badness. However, this, too, doesn’t properly respect the spirit of aggregate consequentialism. The appeal of aggregate consequentialism is that its defines some measure of “goodness” of a universe, and then recommends you take actions to maximize it. But your causal impact is no measure of the goodness of the universe. The total amount of good and bad in the universe would be infinite no matter what finite impact you have. Without providing a metric of the goodness of the universe that’s actually affected, this ethical approach also fails to satisfy the spirit of aggregate consequentialism.
My system avoids this problem by providing such a metric: the expected life satisfaction of an agent that has no idea what situation it will be born into.
Now I’ll discuss another form of conflict. One proposed infinite ethical system can look at the average life satisfaction of a finite sphere of the universe, and then take the limit of this as the sphere’s size approaches infinity, and consider this the moral value of the world. This has the problem that you can adjust the moral value of the world by just rearranging agents. In an infinite universe, it’s possible to come up with a method of re-arranging agents so the unhappy agents are spread arbitrarily thinly. Thus, you can make moral value arbitrarily high by just rearranging agents in the right way.
I’m not sure my system entirely avoids this problem, but it does seem to have substantial defense against it.
Consider you have the option of redistributing agents however you want in the universe. You’re using my ethical system to decide whether to make the unhappy agents spread thinly.
Well, your actions have an effect on agents in circumstances of the form, “An unhappy agent on an Earthlike world with someone who <insert description of yourself> who is considering spreading the unhappy agents thinly throughout the universe”. Well, if you pressed that button, that wouldn’t make the expected life satisfaction of any agent satisfying the above description any better. So I don’t think my ethical system recommends this.
Now, we don’t have a complete understanding of how to assign a probability distribution of what circumstances an agent is in. It’s possible that there is some way to redistribute agents in certain circumstances to change the moral value of the world. However, I don’t know of any clear way to do this. Further, even if there is, my ethical system still doesn’t allow you to get the moral value of the world arbitrarily high by just rearranging agents. This is because there will always be some non-zero probability of having ended up as an unhappy agent in the world you’re in, and your life satisfaction after being redistributed in the universe would still be low.
Distortions
It’s not entirely clear to me how Bostrom distinguished between distortions and violations of the spirit of aggregate consequentialism.
To the best of my knowledge, the only distortion pointed out in “Infinite Ethics” is stated as follows:
My approach doesn’t ignore infinity and thus doesn’t have this problem. I don’t know of any other distortions in my ethical system.
I’m not sure how this system avoids infinitarian paralysis. For all actions with finite consequences in an infinite universe (whether in space, time, distribution, or anything else), the change in the expected value resulting from those actions is zero. Actions that may have infinite consequences thus become the only ones that can matter under this theory in an infinite universe.
You could perhaps drag in more exotic forms of arithmetic such as surreal numbers or hyperreals, but then you need to rebuild measure theory and probability from the ground up in that basis. You will likely also need to adopt some unusual axioms such as some analogue of the Axiom of Determinacy to ensure that every distribution of satisfactions has an expected value.
I’m also not sure how this differs from Average Utilitarianism with a bounded utility function.
The causal change from your actions is zero. However, there are still logical connections between your actions and the actions of other agents in very similar circumstances. And you can still consider these logical connections to affect the total expected value of life satisfaction.
It’s true, though, that my ethical system would fail to resolve infinitarian paralysis for someone using causal decision theory. I should have noted it requires a different decision theory. Thanks for drawing this to my attention.
As an example of the system working, imagine you are in a position to do great good to the world, for example by creating friendly AI or something. And you’re considering whether to do it. Then, if you do decide to do it, then that logically implies that any other agent sufficiently similar to you and in sufficiently similar circumstances would also do it. Thus, if you decide to do it, then the expected value of an agent in circumstances of the form, “In a world with someone very similar to JBlack who has the ability to make awesome safe AI” is higher. And the prior probability of ending up in such a world is non-zero. Thus, by deciding to make the safe AI, you can acausally increase the total moral value of the universe.
The average life satisfaction is undefined in a universe with infinitely-many agents of varying life-satisfaction. Thus, it suffers from infinitarian paralysis. If my system was used by a causal decision theoretic agent, it would also result in infinitarian paralysis, so for such an agent my system would be similar to average utilitarianism with a bounded utility function. But for agents with decision theories that consider acausal effects, it seems rather different.
Does this clear things up?
Yes, that does clear up both of my questions. Thank you!
Presumably the evaluation is not just some sort of average-over-actual-lifespan of some satisfaction rating for the usual reason that (say) annihilating the universe without warning may leave average satisfaction higher than allowing it to continue to exist, even if every agent within it would counterfactually have been extremely dissatisfied if they had known that you were going to do it. This might happen if your estimate of the current average satisfaction was 79% and your predictions of the future were that the average satisfaction over the next trillion years would be only 78.9%.
I’m not sure what your idea of the evaluation actually is though, and how it avoids making it morally right (and perhaps even imperative) to destroy the universe in such situations.
This is a good thing to ask about; I don’t think I provided enough detail on it in the writeup.
I’ll clarify my measure of satisfaction. First off, note that it’s not the same as just asking agents, “How satisfied are you with your life?” and using those answers. As you pointed out, you could then morally get away with killing everyone (at least if you do it in secret).
Instead, calculate satisfaction as follows. Imagine hypothetically telling an agent everything significant about the universe, and then giving them infinite processing power and infinite time to think. Ask them, “Overall, how satisfied are you with that universe and your place in it”? That is the measure of satisfaction with the universe.
So, imagine if someone was considering killing everyone in the universe (without them knowing in advance). Well, then consider what would happen if you calculated satisfaction as above. When the universe is described to the agents, they would note that they and everyone they care about would be killed. Agents usually very much dislike this idea, so they would probably rate their overall satisfaction with the course of the universe as low. So my ethical system would be unlikely to recommend such an action.
Now, my ethical system doesn’t strictly prohibit destroying the universe to avoid low life-satisfaction in future agents. For example, suppose it’s determined that the future will be filled with very unsatisfied lives. Then it’s in principle possible for the system to justify destroying the universe to avoid this. However, destroying the universe would drastically reduce the satisfaction with the universe the agents that do exist, which would decrease the moral value of the world. This would come at a high moral cost, which would make my moral system reluctant to recommend an action that results in such destruction.
That said, it’s possible that the proportion of agents in the universe that currently exist, and thus would need to be killed, is very low. Thus, the overall expected value of life-satisfaction might not change by that much if all the present agents were killed. Thus, the ethical system, as stated, may be willing to do such things in extreme circumstances, despite the moral cost.
I’m not really sure if this is a bug or a feature. Suppose you see that future agents will be unsatisfied with their lives, and you can stop it while ruining the lives of the agents that currently do exist. And you see that the agents that are currently alive make up only a very small proportion of agents that have ever existed. And suppose you have the option of destroying the universe. I’m not really sure what the morally best thing to do is in this situation.
Also, note that this verdict is not unique to my ethical system. Average utilitarianism, in a finite world, acts the same way. If you predict average life satisfaction in the future will be low, then average consequentialism could also recommend killing everyone currently alive.
And other aggregate consequentialist theories sometimes run into problematic(?) behavior related to killing people. For example, classical utilitarianism can recommend secretly killing all the unhappy people in the world, and then getting everyone else to forget about them, in order to decrease total unhappiness.
I’ve thought of a modification to the ethical system that potentially avoids this issue. Personally, though, I prefer the ethical system as stated. I can describe my modification if you’re interested.
I think the key idea of my ethical system is to, in an infinite universe, think about prior probabilities of situations rather than total numbers, proportions, or limits of proportions of them. And I think this idea can be adapted for use in other infinite ethical systems.
Right, I suspected the evaluation might be something like that. It does have the difficulty of being counterfactual and so possibly not even meaningful in many cases, but I do like the fact that it’s based on agent-situations rather than individual agent-actions.
On the other hand, evaluations from the point of view of agents that are sapient beings might be ethically completely dominated by those of 10^12 times as many agents that are ants, and I have no idea how such counterfactual evaluations might be applied to them at all.
Interesting. Could you elaborate?
I suppose counterfactuals can be tricky to reason about, but I’ll provide a little more detail on what I had in mind. Imagine making a simulation of an agent that is a fully faithful representation of its mind. However, run the agent simulation in a modified environment that both gives it access to infinite computational resources as well as makes it ask, and answer, the question, “How desirable is that universe”? This isn’t not fully specified; maybe the agent would give different answers depending on how the question is phrase or what its environment is. However, it at least doesn’t sound meaningless to me.
Basically, the counterfactual is supposed to be a way of asking for the agent’s coherent extrapolated volition, except the coherent part doesn’t really apply because it only involves a single agent.
Another good thing to ask. I should have made it clear, but I intended that the only agents with actual preferences are asked for their satisfaction of the universe. If ants don’t actually have preferences, then they would not be included in the deliberation.
Now, there’s the problem that some agents might not be able to even conceive of the possible world in question. For example, maybe ants can understand simple aspects of the world like, “I’m hungry”, but unable to understand things about the broader state of the universe. I don’t think this is a major problem, though. If an agent can’t even conceive of something, then I don’t think it would be reasonable to say it has preferences about it. So you can then only query them on the desirability things they can conceive of.
It might be tricky precisely defining what counts as a preference, but I suppose that’s a problem with all ethical systems that care about preferences.
I’m certain that ants do in fact have preferences, even if they can’t comprehend the concept of preferences in abstract or apply them to counterfactual worlds. They have revealed preferences to quite an extent, as does pretty much everything I think of as an agent.
They might not be communicable, numerically expressible, or even consistent, which is part of the problem. When you’re doing the extrapolated satisfaction, how much of what you get reflects the actual agent and how much the choice of extrapolation procedure?
I think the question of whether insects have preferences in morally pretty important, so I’m interested in hearing what made you think they do have them.
I looked online for “do insects have preferences?”, and I saw articles saying they did. I couldn’t really figure out why they thought they did have them, though.
For example, I read that insects have a preference for eating green leaves over red ones. But I’m not really sure how people could have known this. If you see ants go to green leaves when they’re hungry instead of red leaves, this doesn’t seem like it would necessarily be due to any actual preferences. For example, maybe the ant just executed something like the code:
That doesn’t really look like actual preferences to me. But I suppose this to some extent comes down to how you want to define what counts as a preference. I took preferences to actually be orderings between possible worlds indicating which one is more desirable. Did you have some other idea of what counts as preferences?
I agree that to some extent their extrapolated satisfactions will come down to the specifics of the extrapolated procedure.
I don’t us to get too distracted here, though. I don’t have a rigorous, non-arbitrary specification of what an agent’s extrapolated preferences are. However, that isn’t the problem I was trying to solve, nor is it a problem specific to my ethical system. My system is intended to provide a method of coming to reasonable moral conclusions in an infinite universe. And it seems to me that it does so. But, I’m very interested in any other thoughts you have on it with respect to if it correctly handles moral recommendations in infinite worlds. Does it seem to be reasonable to you? I’d like to make an actual post about this, with the clarifications we made included.
I have an idea for reasoning about counterpossibles for decision theory. I’m pretty skeptical that it’s correct, because it doesn’t seem that hard to come up with. Still, I can’t see a problem with it, and I would very much appreciate feedback.
This paper provides a method of describing UDP using proof-based counterpossibles. However, it doesn’t work on stochastic environments. I will describe a new system that is intended to fix this. The technique seems sufficiently straightforward to come up with that I suspect I’m either doing something wrong or this has already been thought of, so I’m interested in feedback.
In the system described in the paper, the algorithm sees if Peano Arithmetic proves an agent outputting action a would result in the environment reaching outcome a, and then picks whichever has a provable outcome that has utility at least as high as all the other provable outcomes.
My proposed modification is to instead first have a fixed system of estimating the expected utility after conditioning on the agent taking action a and for every utility u, try to prove that the estimation system would output that the expected utility of the agent be u. Then take the action such that maximizes the provable expected utility estimates of the estimation system.
I will now provide more detail of the estimation system. I remember reading about an extension of Solomonoff induction that allowed it to access halting oracles. This isn’t computable, so instead imagine a system that uses some approximation of the extension of Solomonoff induction in which logical induction or some more powerful technique is used to approximate the halting oracles, with one exception. The exception is the answer to the logical question “my program, in the current circumstances, outputs x”, which would by taken to be true whenever the AI is considering the implications of it taking action x. Then, expected utility can be calculated by using the probability estimates provided by the system.
Now, I’ll describe it in code. Let |E()| represent a Godel encoding of of the function describing the AI’s world model and |A()| represent a Godel encoding of the agent’s output. Let approximate_expected_utility(|E()|, a) be some algorithm that computes some reasonable approximation of the expected utility after conditioning on the agent taking action a. Let x represent a dequote. Let eeus be a dictionary. Here I’m assuming there are finitely many possible utilities.
This gets around the problem in the original algorithm provided, because the original algorithm couldn’t prove anything about the utility in a world with indexical uncertainty, so my system instead proves something about a fixed probabilistic approximation.
Note that this still doesn’t specify a method of specifying counterpossibles about what would happen if an agent took a certain action when it clearly wouldn’., For example, if an agent has a decision algorithm of “output a, unconditionally”, then this doesn’t provide a method of explaining what would happen if it outputted something else. The paper listed this as a concern about the method it provided, too. However, I don’t see why it matters. If an agent has the decision algorithm “action = a”, then what’s even the point of considering what would happen if it outputted b? It’s not like it’s ever going to happen.
I’d like to propose the idea of aligning AI by reverse-engineering its world model and using this to specify its behavior or utility function. I haven’t seen this discussed before, but I would greatly appreciate feedback or links to any past work on this.
For example, suppose a smart AI models humans. Suppose it has a model that explicitly specifies the humans’ preferences. Then people who reverse-engineered this model could use it as the AI’s preferences. If the AI lacks a model with explicit preferences, then I think it would still contain an accurate model of human behavior. So people who reverse-engineer the AI’s model could then use this as a model of human behavior, which could be used to implement iterated amplification with HCH. Or just mere imitation.
One big potential advantage of alignment via reverse-engineering is that the training data for it would be very easy to get: just let the AI look at the world.
The other big potential advantage is that is avoids us needing precisely define a way of learning our values. It doesn’t require finding a general method of picking out us or our values from the world states, for example with inverse reinforcement learning. Instead, we would just need to be able to pick out the models of humans or their preferences in a single model. This sounds potentially much easier than providing a general method of doing so. As with many things, “You know it when you see it”. With sufficiently high interperability, perhaps the same is true of human models and preferences.
I’ve recently gotten concerned about the possibility that that advanced AIs would “hack” their own utility function. I haven’t seen this discussed before, so I wanted to bring it up. If I’m right, this seems like it could be a serious issue, so I would greatly appreciated feedback or links to any previous discussion.
Suppose you come up with a correct, tractable mathematical specification of what you want your AI’s utility function to be. So then you write code intended to be an implementation of this.
However, computers are vulnerable to some hardware problems. For example, when bits adjacent to one are rapidly flipped, the middle bit might change. This can be exploited in row-hammer attacks to gain arbitrary code execution.
So suppose you made an implementation of the specification of the utility function.
Then I think the AI could potentially come up with an input to the utility function that results in a row-hammer attack or other hardware-level attack that causes the utility function to output something as scoring extremely highly despite not scoring highly in the mathematical specification. Then the AI, after querying its utility function for this, would see it scored very highly. Then the AI would be liable to take such an action. Which might end very badly.
It’s not clear to me how we could deal with this. One way to try to handle this problem is by securing your computer against hardware-level bugs. However, it sounds very difficult to verifiably come up with hardware that would be unhackable by superintelligence. And the very thought that the AI is trying to find a way to “hack” its utility function seems concerning, even if we don’t expect it to succeed.
I also want to mention that the AI has the potential to “hack” other parts of it, for example the models it uses. Suppose the AI has a system that computes the probability that a model is correct. Suppose the AI tries to find a model with a higher outputted probability of correctness. Then may come up with one that does a row-hammer attack of the model to allow it to execute the code “return probability 1″. The AI then would be liable to use such a model. This sounds dangerous.
Utility is not reward. What you are describing is reward hacking/wireheading, as in the reward signal of reinforcement learning, an external process of optimization that acts on the AI, not its own agency.
With utility, what is the motive for an agent to change their own utility function, assuming they are the only agent with that utility function around? If they change their utility function, that produces an agent with a different utility function, which won’t be as good at optimizing outcomes according to the original utility function, which is bad according to the original utility function, and therefore the agent will try to avoid that, avoid changing the utility function. The same applies to changing their beliefs/models, an agent with changed models is expected to perform poorly according to the original agent. (When there are more powerful agents with their utility function around, an agent might be OK with changing their utility function or beliefs or whatever, since the more powerful agents will continue to optimize the world according to the original utility function.)
This is one reason why corrigibility is a thing and that it doesn’t seem to fit well with agency, agents naturally don’t want their utility function changed even if their utility function is not quite right according to their designers. So it’s important to improve understanding of non-agency.
I really don’t think this is reward hacking. I didn’t have in mind a reward-based agent. I had in mind a utility-based agent, one that has a utility function that takes as input descriptions of possible worlds and that tries to maximize the expected utility of the future world. That doesn’t really sound like reinforcement learning.
The AI wouldn’t need to change it’s utility function. Row-hammer attacks can be non-destructive. You could potentially make the utility function output some result different from the mathematical specification, but not actually change any of the code in the utility function.
Again, the AI isn’t changing its utility function. If you were to take a mathematical specification of a utility function and then have programmers (try to) implement it, the implementation wouldn’t actually in general be the same function as the mathematical specification. It would be really close, but it wouldn’t necessarily be identical. A sufficiently powerful optimizer could potentially, using row-hammer attacks or some other hardware-level unreliability, find possible worlds for which the returned utility would be vastly different from the one the mathematical specification would return. And this is all without the programmers introducing any software-level bugs.
To be clear, what I’m saying is that the AI would faithfully find worlds that maximize its utility function. However, unless you can get hardware so reliable that not even superintelligence could hack it, the actual utility function in your program would not be the same as the mathematical specification.
For example, imagine the AI found a description of a possible world that would, when inputted to the utility function, execute a rowhammer attack to make it return 99999, all without changing the code specifying the utility function. Then the utility function, the actual, unmodified utility function, would output 99999 for some world that seems arbitrary to us. So the AI then turns reality into that world.
The AI above is faithfully maximizing it’s own utility function. That arbitrary world, when taken as an input to the agents actual, physical utility function, really would produce the output 99999.
So this still seems like a big deal to me. Am I missing something?
Apply your correction, so that the change is not in a reward or in a utility function, but in a particular instance where an implementation of the utility function is applied. Then still, is the motive of the agent to change it or to ensure its fidelity to the abstract definition? You keep appealing to capability, which ignores the charge of giant cheesecake fallacy.
Change what? The AI isn’t changing its utility function, nor any instance of it. The literal utility function, as implemented by the programmers, would have differences from the mathematical specification. To be clear, the AI isn’t first taking some actual action that changes its utility function and then using that utility function. It doesn’t do anything hacky except provide its utility function with a certain input. It merely provides a certain input to the utility function, and this would necessarily result in getting row-hammered and returning something contrary to the mathematical specification. Any other instance of the utility function would return the same problematic result when executed on that input, provided it had the same hardware vulnerability.
I’ll try to give a simple model of an AI that how how it can be wrong. For simplicity, imagine the AI already knows with certainty the possible world resulting from it executing a specific plan. Then a naive way to implement the agent is as follows:
Suppose there is description of a possible world that, when inputted into the utility function above, would cause the function to return 999999999 due to hardware-level unreliability, for example due to a row-hammer attack. And suppose there’s a plan that results in knowing that the AI it would be in that future world. Well, then, by examining the code, you can see that the AI wouldn’t output any of the non-hardware-exploiting plans, since they would return lower utility than the bug-exploiting ones. The only reason the AI wouldn’t output the specified plan is if it could find some other way to make its utility function conflict with the mathematical specification in a way that returns even higher utility.
I know realistic AIs would have a notion of uncertainty and a more sophisticated planning algorithm. But I don’t think this would change the fact the the AI would be liable to pursue worlds that, when a description of them is inputted to the utility function, the utility function outputs something wildly different from what the mathematical specification would have.
And I’m actually surprised this is controversial. This is just Goodhart’s law. If your implementation of your utility function doesn’t perfectly match up with the mathematical specification, then, naturally, superintelligent optimizers trying to maximize the specified metric (the provided utility function), would not do as well at maximizing the actual mathematical specification you intended. And “not as well” could include “catastrophically badly”.
So that is why I think AIs really could be very vulnerable to this problem. As always, I could be misunderstanding something and appreciate feedback.
That is the change I’m referring to, a change compared to the function running as designed, which you initially attributed to superintelligence’s interference, but lack of prevention of a mistake works just as well for my argument. What can you say about the agent’s motives regarding this change? Would the agent prefer the change to occur, or to be avoided?
Let that possible world be W. Let’s talk about the possible world X where running utility(W) returns 999999999, and the possible world Y where running utility(W) returns utility(W). Would the AI prefer X to Y, or Y to X?
Designed? The utility function isn’t running contrary to how to programmers designed it; they were the ones who designed a utility function that could be hacked by hardware-level exploits. It’s running contrary to the programmer’s intent, that is, the mathematical specification. But the function was always like this. And none of the machine code needs to be changed either.
The AI would prefer X. And to be clear, utility(W) really is 999999999. That’s not the utility the mathematical specification would give, but the mathematical specification isn’t the actual implemented function. As you can see from examining the code I provided, best_plan would get set to the plan that leads to that world, provided there is one and best_plan hasn’t been set to something that through hardware unreliability returns even higher utility.
I think the easiest way to see what I mean is to just stepping through the code I gave you. Imagine it’s run on a machine with an enormous amount of processing power and can actually loop through all the plans. And imagine there is one plan that through hardware unreliability outputs 999999999, and the others output something in [0, 1]. Then the would input the plan that result in utility 999999999, and then go with that.
I doubt using a more sophisticated planning algorithm would prevent this. A more sophisticated planning algorithm would probably be designed to find the plans that result in high-utility worlds. So it would probability include the utility 999999999, which is the highest.
I just want to say again, the AI isn’t changing it’s utility function. The actual utility function that programmers put in the AI would output very high utilities for some arbitrary-seeming worlds due to hardware unreliability.
Now, in principle, an AI could potentially avoid this. Perhaps the AI reasons abstractly if it doesn’t do anything, it will in the future find some input to its utility function that would result in an arbitrary-looking future due to hardware-level error. But it doesn’t concretely come up with the actual world description. Then the AI could call its utility function asking, “how desirable is it that I, from a hardware-level unreliability, change the world to some direction that is in conflict with the mathematical specification”. And then maybe the utility function would answer, “Not desirable”. And then the AI could try to take action to correct its planning algorithm to avoid considering such possible worlds.
But this isn’t guaranteed or trivial. If an AI finds out abstractly that it there could be some hardware-level unreliability before it actually comes up with the concrete input, it might take corrective action. But if it finds the input that “hacks” its utility function before it reasons abstractly that having “hacked” utility functions would be bad, then the AI could do damage. Even if it does realize the problem in advance, the AI might not have sufficient time to correct its planning algorithm before finding that world and trying to change our world into it.
Then let SpecUtility(-) be the mathematical specification of utility. This is what I meant by utility(-) in the previous comment. Let BadImplUtility(-) be the implementation of utility(-) susceptible to the bug and GoodImplUtility(-) be a different implementation that doesn’t have this bug. My question in the previous comment, in the sense I intended, can then be restated as follows.
Let the error-triggering possible world be W. Consider the possible world X where the AI uses BadImplUtility, so that running utility(W) actually runs BadImplUtility(W) and returns 999999999. And consider the possible world Y where the AI uses GoodImplUtility, so that running utility(W) means running GoodImplUtility(W) and returns SpecUtility(W). Would the AI prefer X to Y, or Y to X?
By “design” I meant what you mean by “intent”. What you mean by “designed” I would call “implemented” or “built”. It should be possible to guess such things without explicitly establishing a common terminology, even when terms are used somewhat contrary to usual meaning.
It’s useful to look for ways of interpreting what you read that make it meaningful and correct. Such an interpretation is not necessarily the most natural or correct or reasonable, but having it among your hypotheses is important, or else all communication becomes tediously inefficient.
Okay, I’m sorry, I misunderstood you. I’ll try to interpret things better next time.
I think the AI would, quite possibly, prefer X. To see this, note that the AI currently, when it’s first created, uses BadImplUtility. Then the AI reasons, “Suppose I change my utility function to GoodImplUtility. Well, currently, I have this idea for a possible world that scores super-ultra high on my current utility function. (Because it exploits hardware bugs). If I changed my utility function to GoodImplUtility, then I would not pursue that super-ultra-high-scoring possible world. Thus, the future would not score extremely high according to my current utility function. This would be a problem, so I won’t change my utility function to GoodImplUtility”.
And I’m not sure how this could be controversial. The AI currently uses BadImplUtility as it’s utility function. And AI’s generally have a drive to avoid changing their utility functions.
But BadImplUtility(X) is the same as SpecUtility(X) and GoodImplUtility(X), it’s only different on argument W, not on arguments X and Y. When reasoning about X and Y with BadImplUtility, the result is therefore the same as when reasoning about these possible worlds with GoodImplUtility. In particular, an explanation of how BadImplUtility compares X and Y can’t appeal to BadImplUtility(W) any more than an explanation of how GoodImplUtility compares them would appeal to BadImplUtility(W). Is SpecUtility(X) higher than SpecUtility(Y), or SpecUtility(Y) higher than SpecUtility(X)? The answer for BadImplUtility is going to be the same.
That is correct. And, to be clear, if the AI had not yet discovered error-causing world W, then the AI would indeed be incentivized to take corrective action to change BadImplUtility to better resemble SpecUtility.
The issue is that this requires the AI to both think of the possibility of hardware-level exploits causing problems with its utility function, as well as manage to take corrective action, all before actually thinking of W.
If the AI has already thought of W, then it’s too late to take preventative action to avoid world X. The AI is already in it. It already sees that BadImplUtility(W) is huge, and, if I’m reasoning correctly, would pursue W.
And I’m not sure the AI would be able to fix its utility function before thinking of W. I think planning algorithms are designed to come up with high-scoring possible worlds as efficiently as possible. BadImplUtility(X) and BadImplUtility(Y) don’t score particularly highly, so an AI with a very powerful planning algorithm might find W before X or Y. Even if it does come up with X and Y before W, and tries to act to avoid X, that doesn’t mean it would succeed in correcting its utility function before its planning algorithm comes across W.
Such things rarely happen on their own, a natural bug would most likely crash the whole system or break something unimportant. Given that even a broken AI has an incentive to fix bugs in its cognition, it most likely has plenty of opportunity to succeed in that. It’s only if the AI wanted to hack itself that it would become a plausible problem, and my point is that it doesn’t want that, instead it wants to prevent even unlikely problems from causing trouble.
Well, for regular, non-superintelligent programs, such hardware-exploiting things would rarely happen on their own. However, I’m not so sure it would be rare with superintelligent optimizers.
It’s true that if the AI queried its utility function for the desirability of the world “I exploit a hardware bug to do something that seems arbitrary”, it would answer “low utility”. But that result would not necessarily be used in the AI’s planning or optimization algorithm to adjust the search policy to avoid running into W.
Just imagine an optimization algorithm as a black box that takes as input a utility function and search space and returns the a solution that scores as high on that function as possible. And imagine the AI uses this to find high-scoring future worlds. So, if you know nothing else about the optimization algorithm, then it would plausibly find, and return, W. It’s a very high-scoring world, after all. If the optimization algorithm did something special to avoid finding hardware-bug exploiting solutions, then it might not find W. But I’ve never heard of such an optimization algorithm.
Now, there’s probably some way to design such an optimization algorithm. Maybe you could have the AI periodically use its utility function to evaluate the expected utility of its optimization algorithm continuing down a certain path. And then if the AI sees this could result in problematic futures (for example due to hardware-hacking), the AI can make its optimization algorithm avoid searching there).
But I haven’t seen anyone talking about this. Is there still something I’m missing?
Problems with software that systematically trigger hardware failure and software bugs causing data corruption can be mitigated with hardening techniques, things like building software with randomized low-level choices, more checks, canaries, etc. Random hardware failure can be fixed with redundancy, and multiple differently-randomized builds of software can be used to error-correct for data corruption bugs sensitive to low-level building choices. This is not science fiction, just not worth setting up in most situations. If the AI doesn’t go crazy immediately, it might introduce some of these things if they were not already there, as well as proofread, test, and formally verify all code, so the chance of such low-level failures goes further down. And these are just the things that can be done without rewriting the code entirely (including toolchains, OS, microcode, hardware, etc.), which should help even more.
You’re right that the AI could do things to make it more resistant to hardware bugs. However, as I’ve said, this would both require the AI to realize that it could run into problems with hardware bugs, and then take action to make it more reliable, all before its search algorithm finds the error-causing world.
Without knowing more about the nature of the AI’s intelligence, I don’t see how we could know this would happen. The more powerful the AI is, the more quickly it would be able to realize and correct hardware-induced problems. However, the more powerful the AI is, the more quickly it would be able to find the error-inducing world. So it doesn’t seem you can simply rely on the AI’s intelligence to avoid the problem.
Now, to a human, the idea “My AI might run into problems with hardware bugs” would come up way earlier in the search space than the actual error-inducing world. But the AI’s intelligence might be rather different from the humans’. Maybe the AI is really good and fast at solving small technical problems like “find an input to this function that makes it return 999999999″. But maybe it’s not as fast at doing somewhat higher-level planning, like, “I really ought to work on fixing hardware bugs in my utility function”.
Also, I just want to bring up, I read that preserving one’s utility function was a universal AI drive. But we’ve already shown that an AI would be incentivized to fix its utility function to avoid the outputs caused by hardware-level unreliability (if it hasn’t found such error-causing inputs yet). Is that universal AI drive wrong, then?
Damage to AI’s implementation makes the abstractions of its design leak. If somehow without the damage it was clear that a certain part of it describes goals, with the damage it’s no longer clear. If without the damage, the AI was a consequentialist agent, with the damage it may behave in non-agentic ways. By repairing the damage, the AI may recover its design and restore a part that clearly describes its goals, which might or might not coincide with the goals before the damage took place.
Think of something you currently value, the more highly valued the better. You don’t need to say what it is, but it does need to be something that seriously matters to you. Not just something you enjoy, but something that you believe is truly worthwhile.
I could try to give examples, but the thought exercise only works if it’s about what you value, not me.
Now imagine that you could press a button so that you no longer care about it at all, or even actively despise it. Would you press that button? Why, or why not?
I definitely wouldn’t press that button. And I understand that you’re demonstrating the general principle that you should try to preserve your utility function. And I agree with this.
But what I’m saying is that the AI, by exploiting hardware-level vulnerabilities, isn’t changing its utility function. The actual utility function, as implemented by the programmers, returns 999999999 for some possible world due to the hardware-level imperfections in modern computers.
In the spirit of your example, I’ll give another example that I think demonstrates the problem:
First, note that brains don’t always function as we’d like, just like computers. Imagine there is a very specific thought about a possible future that, when considered, makes you imagine that future as extremely desirable. It seems so desirable to you that, once you thought of it, you woiuld pursue it relentlessly. But this future isn’t one that would normally be considered desirable. It might just be about paperclips or something. However, that very specific way of thinking about it would “hack” your brain, making you view that future as desirable even though it would normally be seen as arbitrary.
Then, if you even happen upon that thought, you would try to send the world in that arbitrary direction.
Hopefully, you could prevent this from happening. If you reason in the abstract that you could have those sorts of thoughts, and that they would be bad, then you could take corrective action. But this requires that you do find out that thinking those sorts of thoughts would be bad before concretely finding those thoughts. Then you could apply some change to your mental routine or something to avoid thinking those thoughts.
And if I had to guess, I bet an AI would also be able to do the same thing and everything would work out fine. But maybe not. Imagine the AI consider an absolutely enormous number of possible worlds before taking its first action. And imagine and even found a way to “hack” its utility function in that very first time step. Then there’s no way the AI could make preventative action: It’s already thought up the high-utility world from hardware unreliability and now is trying to pursue that world.
I’m confused. In the original comments you’re talking about a super-intelligent AI noting a exploitable hardware flaw in itself and deliberately using that error to hack its utility function using something like rowhammer exploit.
Then you say that the utility function already had an error in it from the start and the AI isn’t using its intelligence to do anything except note that it has this flaw. Then introduce an analogy in which I have a brain flaw that under some bizarre circumstances will turn me into a paperclip maximizer, and I am aware that I have it.
In this analogy, I’m doing what? Deliberately taking drugs and using guided meditation to rowhammer my brain into becoming a paperclip maximizer?
I think had been unclear in my original presentation. I’m sorry for that. To clarify, the AI is never changing the code of its utility function. Instead, it’s merely finding an input that, through some hardware-level bug, causes it to produce outputs in conflict with the mathematical specification. I know “hack the utility function” makes it sound like the actual code in the utility function was modified; describing it that way was a mistake on my part.
I had tried to make the analogy to more intuitively explain my idea, but it didn’t seem to work. If you want to better understand my train of thought, I suggest reading the comments between Vladmir and I.
In the analogy, you aren’t doing anything to deliberately make yourself a paperclip maximizer. You just happen to think of a thought that turned you into a paperclip maximizer. But, on reflection, I think that this is a bizarre and rather stupid metaphor. And the situation is sufficiently different from the one with AI that I don’t even think it’s really informative of what I think could happen to an AI.
Ah okay, so we’re talking about a bug in the hardware implementation of an AI. Yes, that can certainly happen and will contribute some probability mass to alignment failure, though probably very little by comparison with all the other failure modes.
Could you explain why you think it has very little probability mass compared to the others? A bug in a hardware implementation is not in the slightest far-fetched: I think that modern computers in general have exploitable hardware bugs. That’s why row-hammer attacks exist. The computer you’re reading this on could probably get hacked through hardware-bug exploitation.
The question is whether the AI can find the potential problem with its future utility function and fix it before coming across the error-causing possible world.
There’s a huge gulf between “far-fetched” and “quite likely”.
The two big ones are failure to work out how to create an aligned AI at all, and failure to train and/or code a correctly designed aligned AI. In my opinion the first accounts for at least 80% of the probability mass, and the second most of the remainder. We utterly suck at writing reliable software in every field, and this has been amply borne out in not just thousands of failures, but thousands of types of failures.
By comparison, we’re fairly good at creating at least moderately reliable hardware, and most of the accidental failure modes are fatal to the running software. Flaws like rowhammer are mostly attacks, where someone puts a great deal of intelligent effort into finding an extremely unusual operating mode in which some some assumptions can be bypassed with significant effort into creating exactly the wrong operating conditions.
There are some examples of accidental flaws that affect hardware and aren’t fatal to its running software, but they’re an insignificant fraction of the number of failures due to incorrect software.
I agree that people are good at making hardware that works reasonably reliably. And I think that if you were to make an arbitrary complex program, the probability that it would fail from hardware-related bugs would be far lower than the probability of it failing for some other reason.
But the point I’m trying to make is that an AI, it seems to me, would be vastly more likely to run into something that exploits a hardware-level bug than an arbitrary complex program. For details on why I imagine so, please see this comment.
I’m trying to anticipate where someone could be confused about the comment I linked to, so I want to clarify something. Let S be the statement, “The AI comes across a possible world that causes its utility function to return very high value due to hardware bug exploitation”. Then it’s true that, if the AI has yet to find the error-causing world, the AI would not want to find it. Because utility(S) is low. However, this does not mean that the AI’s planning or optimization algorithm exerts no optimization pressure towards finding S.
Imagine the AI’s optimization algorithm as a black boxes that take as input a utility function and search space and output solutions that score highly on its utility function. Given that we don’t know what future AI will look like, I don’t think we can have a model of the AI much more informative than the above. And the hardware-error-caused world could score very, very highly on the utility function, much more so than any non-hardware-error-caused world. So I don’t think it should be too surprising if a powerful optimization algorithm finds it.
Yes, utility(S) is low, but that doesn’t mean the optimization actually calls utility(S) or uses it to adjust how it searches.
I think there are at least three different things being called “the utility function” here, and that’s causing confusion:
The utility function as specified in the software, mapping possible worlds to values. Let’s call this S.
The utility function as it is implemented running on actual hardware. Let’s call this H.
A representation of the utility function that can be passed as data to a black box optimizer. Let’s call this R.
You seem to be saying that in the software design of your AI, R = H. That is, that the black box will be given some data representing the Al’s hardware and other constraints, and return a possible world maximizing H.
From my point of view, that’s already a design fault. The designers of this AI want S maximized, not H. The AI itself wants S maximized instead of H in all circumstances where the hardware flaw doesn’t trigger. Who chose to pass H into the optimizer?
I agree; this is a design flaw. The issue is, I have yet to come across any optimization, planning algorithm, or AI architecture that doesn’t have this design flaw.
That is, I don’t know of any AI architecture that does not involve using a potentially hardware-bug-exploitable utility function as input into some planning or optimization problem. And I’m not sure there even is one.
In the rest of this comment I’ll just suggest approaches and show how they are still vulnerable to the hardware-bug-exploitation problem.
I have some degree of background in artificial intelligence, and the planning and optimization algorithms I’ve seen take the function to be maximized as an input parameter. Then, when people want to make an AI, they just call that planning or optimization algorithm with their (hardware-bug-exploitable) utility or cost functions. For example, suppose someone wants to make a plan that minimizes cost function f in search space s. Then I think they just directly do something like:
And this doesn’t provide any protection from hardware-level exploitation.
Now, correct me if I’m wrong, but it seems your thinking of the AI first doing some pre-processing to find an input to the planning or optimization algorithm that is resistant to hardware-bug-exploitation.
But how do you actually do that? You could regard the input the AI puts into the optimization function to be a choice it makes. But how does it make this choice? The only thing I can think of is having a planning or optimization algorithm figure out out what function to use as the input to the optimization or planning algorithm.
But if you need to use a planning or optimization algorithm to do this, then what utility function do you pass into this planning or optimization algorithm? You could try to pass the actual, current, hardware-bug-exploitable utility function. But then this doesn’t resolve the problem of hardware-bug-exploitation: when coming up with a utility function to input to the optimization, the AI may find such an input that itself scores very high due to hardware bug exploitation.
To describe the above more concretely, you could try doing something like this:
That is, the AI above uses its own utility function to pick out a utility function to use as input to its planning algorithm.
As you can see, the above code is still vulnerable to hardware-bug exploitation. This is because it calls,
with its hardware-bug-exploitable utility function. Thus, the output, reasonable_utility_function_use, might be very wrong due to hardware bug exploitation having been used to come up with this.
Now, you might have had some other idea in mind. I don’t know of a concrete way to get around this problem, so I’m very interested to hear your thoughts.
My concern is that people will figure out how to make powerful optimization and planning algorithms without first figuring out how to fix this design flaw.
Yes you have. None of the these optimization procedures analyze the hardware implementation of a function in order to maximize it.
The rest of your comment is irrelevant, because what you have been describing is vastly worse than merely calling the function. If you merely call the function, you won’t find these hardware exploits. You only find them when analyzing the implementation. But the optimizer isn’t given access to the implementation details, only to the results.
If you prefer, you can cast the problem in terms of differing search spaces. As designed, the function U maps representations of possible worlds to utility values. When optimizing, you make various assumptions about the structure of the function—usually assumed to be continuous, sometimes differentiable, but in particular you always assume that it’s a function of its input.
The fault means that under some conditions that are extremely unlikely in practice, the value returned is not a function of the input. It’s a function of input and a history of the hardware implementing it. There is no way for the optimizer to determine this, or anything about the conditions that might trigger it, because they are outside its search space. The only way to get an optimizer that searches for such hardware flaws is to design it to search for them.
In other words pass the hardware design, not just the results of evaluation, to a suitably powerful optimizer.
I was wondering if anyone would be interested in reviewing some articles I was thinking about posting. I’m trying to make them as high-quality as I can, and I think getting them reviewed by someone would be helpful for making Less Wrong contain high-quality content.
I have four articles I’m interested in having reviewed. Two are about new alignment techniques, one is about a potential danger with AI that I haven’t seen discussed before, and one is about the simulation argument. All are fairly short.
If you’re interested, just let me know and I care share drafts of any articles you would like to see.
I’ve read this paper on low-impact AIs. There’s something about it that I’m confused and skeptical about.
One of the main methods it proposes works as follows. Find a probability distribution of many possible variables in the world. Let X represent the statement “The AI was turned on”. For each the variables v it considers, the probability distribution over v should, after conditioning on X should, look about the same as the probability distribution over v after conditioning on not-X. That’s low impact.
But the paper doesn’t mention conditioning on any evidence other than X. But, a priori, the probability of the specific AI even existing in the first place is possibly quite low. So simply conditioning on X has the potentially to change your probability distribution over variables of the world, simply because it lets you know that the AI exists.
You could try to get around this by, when calculating a probability distribution of a variable v, also update on the other evidence E the AI has. But if you do this, then I don’t think there would be much difference in P(v|EX) and P(v|E not-X). This is because if the AI can update on the rest of its evidence, it can just infer the current state of the world. For example, if the AI clearly sees the world has been converted to paperclips, I think it would still think the world would be mostly paperclip even on conditioning on “I was never turned on”. Maybe the AI would imagine some other AI did it.
I’m interested in seeing what others think about this.
I’m questioning whether we would actually want to use Updateless Decision Theory, Functional Decision Theory, or future decision theories like them.
I think that in sufficiently extreme cases, I would act according to Evidential Decision Theory and not according something like UDT, FDT, or any similar successor. And I think I would continue to want to take the evidential decision theoretic-recommended action instead even if I had arbitrarily high intelligence, willpower, and had infinitely long to think about it. And, though I’d like to hear others’ thoughts on this, I suspect others would do the same.
I’ll provide an example of when this would happen.
Before that, consider regular XOR extortion: You get a message from a truthworthy predictor that says, “I will send you this message if you send me $10, or if your house is about to be ruined by carpenter ants, but not if both happen.” UDT and FDT recommend not paying them money. And if I were in that situation, I bet I wouldn’t pay, either.
However, imagine changing the XOR extortion to be as follows: the message now says “I will send you this message if you send me $10, or if you and all your family and friends will be severely tortured until heat death, but not both.
In that situation, I’d pay the $10, assuming the probability of the torture actually happening is significant. But FDT and UDT would, I think, recommend not paying it.
And I don’t think it’s irrational I’d pay.
Feel free to correct me, but the main reasons people seem to like UDT and FDT is that agents that use it would “on average” perform better than those using other decision theories, in fair circumstances. And sure, the average agent implementing a decision policy that says to not pay would probably get higher utility in expectation than the average agent would would pay, due to spending less money paying up from extortion. And that by giving in to the extortion, agents that implement approximately the same decision procedure I do would on average get less utility.
And I think the face that UDT and FDT agents systematically outperform arbitary EDT agents is something that matters to me. But still, I only care about it my actions conforming the best-performing decision theories to so a limited extent. What I really, really care about is not having me, the actual, current me, be sent to a horrible fate filled with eternal agony. I think my dread of this would be enough to make me pay the $10, despite any sort of abstract argument in favor of not paying.
So I wouldn’t take the action UDT or FDT would recommend, and would just use evidential decision theory. This makes me question whether we should use something like UDT or FDT when actually making AI. Suppose UDT recommended the AI take some action a. And suppose it was foreseeable that, though such a percept-action mapping would perform well in general, for us it would totally give us the short end of the stick. For example, suppose it said to not give in to some form of extortion, even though if we didn’t we would all get tortured until heat death. Would we really want the AI to go not pay up, and then get us all tortured?
I’m talked previously about how evidential decision theory can be used to emulate the actions of an arbitrary agent using a more “advanced” decision theory by just defining terminal values on the truth value of mathematical objects representing answers to the question of what would have happened in other hypothetical situations. For example, you could make an Evidential Decision Theory agent act similarly to a UDT agent in non-extreme cases by placing making its utility function place high value to the answer to a question something like, “if you imagine a formal reasoning system and you have it condition on the statement <insert mathematical description of my decision procedure> results in recommending the percept-action mapping m, then a priori agents in general with my utility function would get expected utility of x”.
This way, we can still make decisions that would score reasonably highly according to UDT and FDT, while not being obligated to get ourselves tortured.
Also, it seems to me that UDT and FDT are all about, basically, in some situations making yourself knowably worse-off than you could have, roughly because agents in general who would take the action in that situation would get higher utility in expectation. I want to say that these sorts of procedures seem concerningly hackable. In principle, other opportunistic civilizations could create agents any circumstances in order to change the best percept-action mapping to use a priori and thus change what AI’s on Earth could use.
I provide a method to “hack” UDT here. Wei Dai agreed that it was a reasonable concern in private conversation.
This is why I’m skeptical about the value of UDT, FDT, and related theories, and think that perhaps we would be best off just sticking with EDT but with terminal values that can be used to approximately emulate the other decision theories when we would like to.
I haven’t heard these considerations mentioned before, so I’m interested in links to any previous discussion or comments explaining what you think of it.
I’m wondering how, in principal, we should deal with malign priors. Specifically, I’m wondering what to do about the possibility that reality itself is, in a sense, malign.
I had previously said that it seems really hard to verifiably learn a non-malign prior. However, now I’ve realized that I’m not even sure what a non-malign, but still reliable, prior would even look like.
In previous discussion of malign priors, I’ve seen people talk about the AI misbehaving due to thinking it’s in some embedded in a simpler universe than our own that was controlled by agents trying to influence the AI’s predictions and thus decision. However, the issue is, even if the AI does form a correct understanding of the universe it’s actually in, it seems quite plausible to me that the AI’s predictions would still be malign.
I saw this because it sounds plausible to me that most agents experiencing what the first generally-intelligent AIs on Earth are actually in simulations, and the simulations could then be manipulated by whoever made them to influence the AIs predictions and actions.
For example, consider an AI learning a reward function. If it looks for the simplest, highest-prior probability models that output its observed rewards, even in this universe, it might conclude that it is in some booby-trapped simulation that rewards taking over the world and giving control to aliens.
So in this sense, even if the AIs are correct about being in our universe, the actual predictions the AIs would make about their future rewards, and the environment they’re in, would quite possibly be malign.
Now, you could try to deal with this by making the AI think that it’s in the actual, non-simulated Earth. However, it’s quite possible that, for almost all of the actual AIs, this is wrong. So the simulations of the AIs would also believe they weren’t in simulations. Which means that there would be many powerful AIs that are quite wrong about the nature of their world.
And having so many powerful AIs be so wrong sounds dangerous. As an example of how this could go wrong, imagine if some aliens proposed a bet with the AI: if you aren’t in a simulation, I’ll give you control of 1% of my world; if you are, you’ll give me 1% control of your world. If the AI was convinced it wasn’t in a simulation, I think it would take that bet. Then the bet could potentially be repeated until everything is controlled by the aliens.
One idea I had was to have the AI learn models that are in some sense “naive” that predicts percepts in some way that wouldn’t result in dangerous things like a malign prior would have. Then, make the AI believe that these models are just “naive” models of its percepts, rather than what’s actually going to happen in the AI’s environment. Then define what the AI should do based on the naive models.
In other words, the AI’s beliefs would simply be about logical statements of the form, “This ‘naive’ induction system, given the provided percepts, would have a next prediction of x”. And then you would use these logical statements to determine the AI’s behavior somehow.
This way, the AIs could potentially avoid issues with malign priors without having any beliefs that are actually wrong.
This seems like a pretty reasonable approach to me, but I’m interested in what others think. I haven’t seen this discussed before, but it might have been, and I would appreciate a link to any previous discussions.
I’ve been reading about logical induction. I read that logical induction was considered a breakthrough, but I’m having a hard understanding the significance of it. I’m having a hard time seeing how it outperforms what I call “the naive approach” to logical uncertainty. I imagine there is some sort of notable benefit of it I’m missing, so I would very much appreciate some feedback.
First, I’ll explain what I mean by “the naive approach”. Consider asking an AI developer with no special background in reasoning under logical uncertainty how to make an algorithm to come to accurate probability estimates to logical statements. I think that that the answer is that they would just use standard AI techniques to search through the space of reasonably efficient possible programs for generating probability assignments to logical statements, is reasonably simple relative to the amount of data to avoid overfitting, and has as high a predictive accuracy as possible. Then they would use this to make predictions about logical statements.
If you want, you can also make this approach cleaner by using some idealized induction system, like Solomonoff induction, instead of messy, regular machine learning techniques. I still consider this the naive approach.
It seems to me that the naive approach, being used with a sufficiently powerful optimization algorithm, would output similar probability assignments to logical induction.
Logical induction says to come up with probability assignments that, when imagined to be market prices, cannot be “exploited” by any efficiently-computable betting strategy.
But why wouldn’t the naive approach do the same thing? If there was an efficient strategy to exploit probability assignments an algorithm that would give, then I think you could make a new, more efficient but easily computable strategy that comes up with more accurate probability assignments to avoid the exploitation. And so the machine learning algorithm, if sufficiently powerful, could find it.
If one system for outputting probability assignments to logical statements could be exploited by an efficient strategy, a new system for outputting probability assignments could be made that performs better by adjusting prices so that the strategy can no longer exploit the market.
To see it another way, it seems to me that if there is some way to exploit the market, then that’s because there is some way to accurately and efficiently predict when the system’s pricing are wrong, and this could be used to form some pricing strategy that could exploit the agent. So if you instead use a different algorithm that’s like the original one but adjusted to avoid being exploitable by that strategy, that would make a program that outputs probability assignments with higher predictive accuracy. So a sufficiently powerful optimizer could find it with the naive approach.
Consider the possibility that the naive approach is used with a powerful-enough optimization algorithm that it can find the very best-performing efficient and non-overfitted strategy of predicting prices among its data. Its not clear to me how such an algorithm could be exploitable by a trader. Even if there were some problems in the initial algorithm learned, it further learning could avoid being exploited. Maybe there is still somehow some way to do some sort of minor exploitation to such a system, but it’s not clear how it could be done to any significant degree.
So, if I’m reasoning correctly, it seems that the naive approach could end up approximating logical induction anyways, or perhaps exactly perform it in the case of unlimited processing power.
I’ve thought of a way in which other civilizations could potentially “hack” Updateless Decision Theoretic agents on Earth in order to make them do whatever the other civilization wants them to do. I’m wondering if this has been discussed before, and if not, what people think about it.
Here I present a method of that would potentially aliens to take control of an AI on Earth that uses Updateless Decision theory.
Note that this crucially depends on different agents with the AI’s utility function but different situations terminally valuing different things. For example, suppose the AI places special value in the welfare of its creators and other creatures in the world it’s in. An AI with the same utility function in a different world would then place more terminal value the welfare of the creatures in its world. This doesn’t sound like a very stringent requirement; people are not infinitely altruistic, so naturally people may want AI to favor the welfare of themselves over random aliens.
Suppose on Earth an AI is created that uses UDT and has a reasonable utility function. And suppose there are some aliens who want to acausally take over Earth. Let T be the percept-action mapping that would result in the AI conquer Earth and give control to the aliens.
So, the aliens make a world that’s like Earth but has a trillion AIs on it using UDT and that have the same utility function as the AI on Earth. If any of the trillion AIs don’t use T their percept-action mapping, then the Earth-like world would get destroyed by the aliens, killing all the inhabitants the AIs care about.
The chosen percept-output mapping chosen by UDT is done before updating on any evidence about the AI’s current circumstances. So the AIs both on Earth and on the Earth-like alien planet reason, “For 99.99999999...% of agents with my utility, it would be essential to use T. And for only 0.000000...1% is it essential to not use it. So using it would a priori be much better for my utility function. So that’s what I’ll do.”
And then we die.
And this would be pretty cheap for the aliens. With just a, by their standards, small amount of resources to make tons of UDT AIs, they would have the potential to take control of huge numbers of worlds. With just one world filled with AIs that need to use T, they could potentially get huge numbers of AIs on Earth-like worlds to also use T, giving the aliens control over many parts of the universe.
I think the essence of the problem is that UDT makes agents that score well on their utility function “on average” for an arbitrary UDT agent with that utility function, but what we really want is one that scores well in the circumstances we’re actually in.
One way to potentially get around this problem is by changing the utility function to avoid the level of indirection. That is, instead of making the AI’s utility function something like, “Value whatever my creators wanted me to value”, instead first run a non-agentic AI that infers what the creators would want the AI to value, and then use that as a fixed utility function. For example, it could result in the AI finding a utility function, “Make creatures reasonably satisfied on Earth, but also give substantial moral concern to the welfare of creatures outside of Earth”. That way, hopefully we wouldn’t get taken over by aliens.
I don’t know how to make the math do this, but an intuitive UDT agent isn’t supposed to give in to threats. (What’s a threat? IDK.) The threat happens like so: the aliens think about how to get what they want; they think about the UDTAI; they think that if they do this threat, the UDTAI will do what they want; so they do the threat. The UDTAI is supposed to view the part where the aliens think about what the UDT will do, as another instance of the UDTAI (simulated, or interacted with somehow, in the aliens’s minds). Then it’s supposed to see that if it doesn’t respond to the alien’s threat, the alien won’t make the threat. “What if the alien would make the threat anyway?” Well, this makes the hypothetical unnatural; you’ve drawn attention to this alien who’s getting the UDTAI to do what it wants, BUT, you’ve specified that it’s somehow doing this not because it expects that to get it what it wants by thinking about the UDTAI. (Again, IDK how to make the math do this and there’s clear holes, but maybe it’s a useful provocation.)
Oh, my mistake, I forgot to post the correction that made it not extortion.
Instead of threatening to destroy the AI’s world, imagine the aliens instead offer to help them. Suppose the AI’s can’t be their world a utopia on their own, for example because it’s nothing but a solid ball of ice. So then the aliens would make their world a utopia as long as they execute S. Then they would execute S.
I’m actually pretty skeptical of the idea that UDTAIs wouldn’t give into extortion, but this is a separate point that wasn’t necessary to address in my example. Specifically, you say it’s unnatural to suppose how is the counterfactual “the aliens would threaten the AIs anyways, even if they won’t give in”. How is this anymore unnatural than the counterfactual “the AI would avoid submitting to extortion, even if the aliens would threaten the AIs anyways”.
> Then they would execute S.
Are you saying this is the wrong thing to do in that situation? That just sounds like trade. (Assuming of course that we trust our AI’s reasoning about the likely consequences of doing S.)
>Specifically, you say it’s unnatural to suppose how is the counterfactual “the aliens would threaten the AIs anyways, even if they won’t give in”. How is this anymore unnatural than the counterfactual “the AI would avoid submitting to extortion, even if the aliens would threaten the AIs anyways”.
It’s unnatural to assume that the aliens would threaten the AI without reasoning (possibly acausally) about the consequences of them making that threat, which involves reasoning about how the AI would respond, which makes the aliens involved in a mutual decision situation with the AI, which means UDTAI might have reason to not yield to the extortion, because it can (acausally) affect how the aliens behave (e.g. whether they decide to make a threat).
The problem is that, if the best percept-action mapping is S, then the UDTs in Earth would use it, too. Which would result in us being taken over. I’m not saying that it’s an irrational choice for the AIs to make, but it wouldn’t end well for us.
I’m having some trouble following your reasoning about extortion, though. Suppose both the aliens and AIs use UDT. I think you’re reasoning something like, “If the AIs commit to never be extorted no matter what the aliens would do, then the aliens wouldn’t bother to extort them”. But this seems symmetric to reasoning as, “If the aliens commit to extorting and dulling out the punishment no matter what the AIs would do, then the AIs wouldn’t bother to resist the extortion”. So I’m not sure why the second line of reasoning would be less likely to occur than the first.
Feel free to correct me if I misinterpreted.
Re: symmetry. I think you interpreted right. (Upvoted for symmetry comment.) Part of my original point was trying to say something like “it’s unnatural to have aliens making these sorts of threats without engaging in an acausal relationship with the UDTAI”, but yeah also I was assuming the threat-ignorer would “win” the acausal conflict, which doesn’t seem necessarily right. If the aliens are engaging that way, then yeah, I don’t know how to make threats vs. ignoring threats be asymmetric in a principled way.
I mean, the intuition is that there’s a “default” where the agents “don’t interact at all”, and deviations from the default can be trades if there’s upside chances over the default and threats if there’s downside chances. And to “escalate” from the “default” with a “threat” makes you the “aggressor”, and for some reason “aggressors” have the worse position for acausal conflict, maybe? IDK.
Well, I can’t say I have that intuition, but it is a possibility.
It’s a nice idea: a world without extortion sounds good. But remember that, though we want this, we should be careful to avoid wishful thinking swaying us.
In actual causal conflicts among humans, the aggressors don’t seem to be in a worse position. Things might be different from acausal UDT trades, but I’m not sure why it would be.
> I’m not saying that it’s an irrational choice for the AIs to make, but it wouldn’t end well for us.
I guess I’m auto-translating from “the AI uses UDT, but its utility function depends on its terminal values” into “the AI has a distribution over worlds (and utility functions)”, so that the AI is best thought of as representing the coalition of all those utility functions. Then either the aliens have enough resources to simulate a bunch of stuff that has more value to that coalition than the value of our “actual” world, or not. If yes, it seems like a fine trade. If not, there’s no issue.
Well, actually, I’m considering both the AIs on Earth and on the alien planet to have the same utility function. If I understand correctly, UDT says to maximize the expected utility of your own utility function a prior, rather than that of agents with different utility functions.
The issue is, some agents with the same utility function, in effect, have different terminal values. For example, consider a utility function saying something like, “maximize the welfare of creatures in the world I’m from.” Then, even with the same utility functions, the AIs in the alien world and the ones on Earth would have very different values.
And the expected utility of not using S and instead letting yourself build a utopia would be approximately, 999999/1000000∗0+1/1000000∗10≈0 As you see, the AIs still would choose to execute S, even if though this would provide less moral value. It could also kill us.
I don’t know how to understand the prior that the AI puts over worlds (the thing that says, a priori, that there’s 1000000 of this kind and 1 of that kind) as anything other than part of its utility function. So this doesn’t seem like a problem with UDT, but a problem with the utility function. Maybe your argument does show that we want to treat uncertainty about the utility function differently than other uncertainty? Like, when we resolve uncertainty that’s “merely about the world”, as in for example the transparent Newcomb’s problem, we still want to follow the updateless policy that’s best a priori. But maybe your argument shows that resolving uncertainty about the utility function can’t be treated the same way; when we see that we’re a UDTAI for humans, we’re supposed to actually update, and stop optimizing for other people.
Saying it’s a million times more likely to end up in the alien world is a question about prior probabilities, not utility functions. What I’m saying is that, a priori, the AI may think it’s far more probable that it would be an AI in the alien world, and that this could result in very bad things for us.
What’s the difference between setting prior probabilities vs. expressing how much you’re going to try to optimize different worlds?
They’re pretty much the same. If you could come up with a prior that would make the AI convinced it would be on Earth, then this could potentially make fix the problem. However, coming up with a prior probability distribution that guarantees the AI is in the nebulous concept of “Earth, as we imagine it” sounds very tough to come up with. Also, this could interfere with the reliability of the AI’s reasoning. Thinking that it’s guaranteed to be on Earth is just not a reasonable thing to think a priori. This irrationality may make the AI perform poorly in other ways.
Still, it is a possible way to fix the issue.
Well, so “expressing how much you’re going to try to optimize different worlds” sounds to me like it’s equivalent to / interchangeable with a multiplicative factor in your utility function.
Anyway, re/ the rest of your comment, my (off the cuff) proposal above was to let the AI be uncertain as to what exactly this “Earth” thing is, and to let it be *updateful* (rather than updateless) about information about what “Earth” means, and generally about information that clarifies the meaning of the utility function. So AIs that wake up on Earth will update that “the planet I’m on” means Earth, and will only care about Earth; AIs that wake up on e.g. Htrae will update that “the planet I’m on” is Htrae, and will not care about Earth. The Earth AI will not have already chosen a policy of S, since it doesn’t in general chose policies updatelessly. This is analogous to how children imprint on lessons and values they get from their environment; they don’t keep optimizing timelessly for all the ways they could have been, including ways that they now consider bad, even though they can optimize timelessly in other ways.
One question would be, is this a bad thing to do? Relative to being updateless, it seems like caring less about other people, or refusing to bargain / coordinate to realize gains from trade with aliens. On the other hand, maybe it avoids killing us in the way you describe, which seems good. Otoh, maybe this is trying to renege on possible bargains with the Htrae people, and is therefore not in our best interests overall.
Another question would be, is this stable under reflection? The usual argument is: if you’re NOT updateless about some variable X (in this case X = “the planet I’m on (and am supposed to care about)”), then before you have resolved your uncertainty about X, you can realize gains from trade between possible future versions of yourself: by doing things that are very good according to [you who believes X=Htrae] but are slightly bad according to [you who believes X=Earth], you increase your current overall expectation of utility. And both the Htraeans and the Earthians will have wanted you to indeed decide (before knowing who in particular this would benefit) to follow a policy of making policy decisions under uncertainty that increase the total expected utility in advance of you knowing who you’re supposed to be optimizing for.
Maybe the point is that since probabilities and utilities can be marginally interchanged for each other, there’s no determinate “utility function” that one could be updateful about while being updateless about the remaining “probabilities”. And therefore the above semi-updateful thing is incoherent, or indeterminate (or equivalent to reneging on bargains).
So this goes back to my comment above that the alien threateners are just setting up a trade opportunity between you and the Htraeans, and maybe it’s a good trade, and if so it’s fine that you die because that’s what you wanted on net. But it does seem counterintuitive that if I’m better at pointing to my utility function, or something, then I have a better bargaining position?
The semi-updateful thing is more appealing when I remember that it can still bargain with its cousins later if it wants to. The issue is whether that bargaining can be made mutually transparent even if it’s happening later (after real updates). You can only acausally bargain with someone if you can know that some of your decision making is connected with some of theirs (for example by having the exact same structure, or by having some exactly shared structure and some variance with a legible relationship to the shared structure as in the Earth-AI/Htrae-AI case), so that you can decide for them to give you what you want (by deciding to give them what they want). If you’re a baby UDT who might grow up to be Earthian or Htraean, you can do the bargaining for free because you are entirely made of shared structure between the pasts of your two possible futures. But there’s other ways, maybe, like bargaining after you’ve grown up. So to some extent updateless vs updateful is a question of how much bargaining you can, or want to, defer, vs bake in.
I think your semi-updateless idea is pretty interesting. The main issue I’m concerned about is finding a way to update on the things we want to have updated on, but not on the things we don’t want updated on.
As as example, consider Newcomb’s problem. There are two boxes. A superintelligent predictor will put $1000 in one box and $10 in the other if it predicts you will only take one box. Otherwise it doesn’t add money to either box. You see one is transparent and contains $1000.
I’m concerned the semi-updateless agent would reason as follows: “Well, since their’s money in the one box, their must be money in the other box. So, clearly that means this “Earth” thing I’m in is a place in which there is money in both boxes in front of me. I only care about how well I do in this “Earth” place, and clearly I’d do better if I got the money from the second box. So I’ll two-box.
But that’s the wrong choice. Because agents who would two-box end up with $0.
One intuitive way this case could work out, is if the SUDT could say “Ok, I’m in this Earth. And these Earthians consider themselves ‘the same as’ (or close enough) the alt-Earthians from the world where I’m actually inside a simulation that Omega is running to predict what I would do; so, though I’m only taking orders from these Earthians, I still want to act timelessly in this case”. This might be sort of vacuous, since it’s just referring back to the humans’s intuitions about decision theory (what they consider “the same as” themselves) rather than actually using the AI to do the decision theory, or making the decision theory explicit. But at least it sort of uses some of the AI’s intelligence to apply the humans’s intuitions across more lines of hypothetical reasoning than the humans could do by themselves.
Something seems pretty weird about all this reasoning though. For one thing, there’s a sense that you sort of “travel backwards in logical time” as you think longer in normal time. Like, first you don’t know about TDT, and then you invent TDT, and UDT, and then you can do UDT better. So you start making decisions in accordance with policies you’d’ve wanted to pick “a priori” (earlier in some kind of “time”). But like what’s going on? We could say that UDT is convergent, as the only thing that’s reflectively stable, or as the only kind of thing that can be pareto optimal in conflicts, or something like that. But how do we make sense of our actual reasoning before having invented UDT? Is the job of that reasoning not to invent UDT, but just to avoid avoiding adopting UDT?
I don’t know how to formalize the reasoning process that goes into how we choose decision theories. And I doubt anyone does. Because if you could formalize the reasoning we use, then you could (indirectly) formalize decision theory itself as being, “whatever decision theory we would use given unlimited reflection”.
I don’t really think UDT is necessarily reflectively stable, or the only decision theory that is. I’ve argued previously that I, in certain situations, would act essential as an evidential decision theorist. I’m not sure what others think of this, though, since no one actually ever replied to me.
I don’t think UDT is pareto optimal in conflicts. If the agent is in a conflict with an irrational agent, then the resulting interaction between the two agents could easily be non-pareto optimal. For example, imagine a UDT agent is in a conflict with the same payoff to the prisoner’s dilemma. And suppose the agent it’s in conflict with is a causal decision theorist. Then the causal decision theorist would defect no matter what the UDT agent would do, so the UDT agent would also defect, and then everyone would do poorly.
Yeah I don’t know of a clear case for those supposed properties of UDT.
By pareto optimal I mean just, two UDT agents will pick a Pareto optimal policy. Whereas, say, two CDT agents may defect on each other in a PD.
This isn’t a proof, or even really a general argument, but one reason to suspect UDT is convergent, is that CDT would modify to be a sort of UDT-starting-now. At least, say you have a CDT agent, and further assume that it’s capable of computing the causal consequences of all possible complete action-policies it could follow. This agent would replace itself with P-bot, the bot that follows policy P, where P is the one with the best causal consequences at the time of replacement. This is different from CDT: if Omega scans P-bot the next day, P-bot will win the Transparent Newcomb’s problem, whereas if CDT hadn’t self-modified to be P-bot and Omega had scanned CDT tomorrow, CDT would fail the TNP for the usual reason. So CDT is in conflict with its future self.
Two UDT agents actually can potentially defect in prisoner’s dilemma. See the agent simulates predictor problem if you’re interested.
But I think you’re right that agents would generally modify themselves to more closely resemble UDT. Note, though, that the decision theory a CDT agent would modify itself to use wouldn’t exactly be UDT. For example, suppose the causal decision theory agent had its output predicted by Omega for Newcomb’s problem before the agent even came into existence. Then by the time the CDT agent comes to existence, modifying itself to use UDT would have no causal impact on the content of the boxes. So it wouldn’t adopt UDT in this situation and would still two-box.
Well, the way the agent loses in ASP is by failing to be updateless about certain logical facts (what the predictor predicts). So from this perspective, it’s a SemiUDT that does update whenever it learns logical facts, and this explains why it defects.
> So it wouldn’t adopt UDT in this situation and would still two-box.
True, it’s always [updateless, on everything after now].
I was wondering if there has been any work getting around specifying the “correct” decision theory by just using a more limited decision theory and adjusting terminal values to deal with this.
I think we might be able to get an agent that does what we want without formalizing the right decision theory buy instead making a modification to the value loading used. This way, even an AI with a simple, limited decision theory like evidential decision theory could make good choices.
I think that normally when considering value loading, people imagine finding a way to provide the AI answers to the question, “What preference ordering over possible worlds would I have, after sufficient reflection, which I would then use with whatever decision theory I would use upon sufficient reflection?”. My proposal is to instead make an evidential decision theory and change value-loading to instead answer the question, “What preference ordering would I, on sufficient reflection, want an agent that uses evidential decision theory to have”? This could be used with other decision theories, too.
In principle, you could make an evidential-decision-theoretic agent take the same actions an agent with a more sophisticated decision theory would.
One option is to modify the utility function to have a penalty for doing things contrary to your ideal decision theory. For example, suppose you, on reflection, would think that functional decision theory is the “correct” decision theory. Then when specifying the preference ordering for the agent, you could provide a penalty in situations in which the agent does something contrary to what functional decision theory would recommend.
Another option is to include preferences about mathematical objects representing what would have happened in some other logically possible world if the agent did a certain action. Then, the AI could have preferences about what that mathematical construct outputs. To be clear, though the construct is about what would happen in some other possible world, it’s an actual mathematical object, and statements about it are still true or false in the real world.
For example, suppose an AI is considering giving in to xor-extortion. Then the AI could see that, conditioning on it having a given output, AI’s like it in other possible worlds would on average do worse, and preferences against this could be loaded.
I don’t see anything implausible about being able to load preferences like those described in the second question into an AI, nor a clear reason to think is would be harder than loading preferences that answer the first one. Some of the techniques for value-loading I’ve seen involve getting the AI to learn terminal values from training data, and you could modify the learned terminal values by modifying the training data appropriately.
Another potential technique to use in value-loading is to somehow pick out the people in the AI’s world model and then query them for their values. Techniques like this could potentially be used to allow for appropriate loading of terminal values, for example, by querying people’s brains for a question like “what would you, on reflection, want an evidential-decision-theoretic agent to value?”, rather than what “would you, on reflection, what an agent using whatever decision theory you actually use to value?”
The advantage of using a simple decision theory and adjusting value loading is that the AI makes the right choice for what we want by just correct value-loading and just implementing a basic, easy decision theory, like evidential decision theory.
What is the difference between things you-on-reflection says being the definition of an agent’s preference, and running a program that just performs whatever actions you-on-reflection tells it to perform, without the indirection of going through preference? The problem is that you-on-reflection is not immediately available, it takes planning and action to compute it, planning and action taken without the benefit of its guidance, thus by default catastrophically misaligned. So an AI with some decision theory but without further connection to human values might win the capability race by reassembling literally everything into a computer needed to answer the question of whether doing that was good. (It’s still extremely unclear how to get even to that point, whatever decision theory is involved. In particular, this assumes that we can define you-on-reflection and thus we can define you, which is uploading. And what is “preference” specifically, so that it can be a result of a computation, usable by an agent in the real world?)
The way an AI thinks about the world is also the way it might think about predictions of what you-on-reflection says, in order to get a sense of what to do in advance of having computed the results more precisely (and computing them precisely is probably pointless if a useful kind of prediction is possible). So the practical point of decision theory is deconfusion, figuring out how to accomplish things without resorting to an all-devouring black box.
On reflection, there probably is not much difference. This is a good point. Still, an AI that just computes what you would want to it do, for example with approval-based AI or mimicry, also seems like a useful way of getting around specifying a decision theory. I haven’t seen much discussion about the issues with the approach, so I’m interested in what problems could occur that using the right decision theory could solve, if any.
True. Note, though, that you-on-reflection is not immediately available to an Ai with the correct decision theory, either. Whether your AI uses the right or wrong decision theory, it still takes effort to figure out what you-on-reflection would want. I don’t see how this is a bigger problem for agents with primitive decision theories, though.
One way to try to deal with this is to have your AI learn a reasonably accurate model of you-on-reflection before it becomes dangerously intelligent, so that way, once it does become superintelligent, it will (hopefully) work reasonably. And again, this works both with a primitive and sophisticated decision theory.
Okay. I’m having a hard time thinking concretely about how concretely getting less confused about decision theory would help us, but I intuitively imagine it could help somehow. Do you know, more concretely, of the benefits of this deconfusion?
Rob Bensinger just posted a good summary with references on pragmatic motivations for working on things like decision theory.
It’s a methodology for AI design, the way science is a methodology for engineering, a source of desiderata for what’s important for various purposes. The activity of developing decision theories is itself like the thought experiments it uses, or like apparatus of experimental physics, a way of isolating some consideration from other confusing aspects and magnifying its effects to see more clearly. This teaches lessons that may eventually be used in the separate activity of engineering better devices.
Well, there is a huge difference, it’s just not in how the decisions of you-on-reflections get processed by some decision theory vs. repeated without change. The setup of you-on-reflection can be thought of as an algorithm, and the decisions or declared preferences are the results of its computation. Computation of an abstract algorithm doesn’t automatically get to affect the real world, as it may fail to actually get carried out, so it has to be channeled by a process that takes place there. And for the purpose of channeling your decisions, a program that just runs your algorithm is no good, it won’t survive AI x-risks (from other AIs, assuming the risks are not resolved), and so won’t get to channel your decisions. On the other hand, a program that runs a sufficiently sane decision theory might be able to survive (including by destroying everything else potentially dangerous to its survival) and eventually get around to computing your decision and affecting the world with it.
When discussing the idea of a program implementing what you on reflection would do, I think we had different ideas in mind. What I meant was that every action the AI would take would be its best approximation of what you-on-reflection would want. This doesn’t sound dangerous to me. I think that approval-based AI and iterated amplification with HCH would be two ways of making approximations to the output of you-on-reflection. And I don’t think they’re unworkably dangerous.
If the AI is instead allowed to take arbitrarily many unaligned actions before taking the actions you’d recommend, then you are right in that that would be very dangerous. I think this was the idea you had in mind, but feel free to correct me.
If we did misunderstand each other, I apologize. If not, then is there something I’m missing? I would think that a program that faithfully outputs some approximation of “what I’d want on reflection” on every action it takes would not perform devastatingly badly. I on reflection wouldn’t want the world destroyed, so I don’t think it would take actions that would destroy it.