Non-trivial indeed, but why would it stop a capable AI determined to maximize its utility?
Further notes:
F contains everything that can be reached from O, along a geodesic with proper-time more than two hours.
Geodesic is probably too weak, as it implies ballistic motion only. A timelike or null future-pointing path is somewhat safer (one should be allowed to use engines). Still, anything outside O’s original lightcone, or within a two-hour proper-time window appears to be a fair game for paperclipping.
Moreover, any timelike curve can be mimicked by a near-zero-proper-time curve. Does this mean the AI can extend the window out into the future indefinitely?
Do you mean the actual AI making an actual physical copy of the actual universe?
Added a clause that the AI must obey the laws of physics; it was implicit, but now it’s explicit.
Still, anything outside O’s original lightcone, or within a two-hour proper-time window appears to be a fair game for paperclipping.
Yes. But the AI can only reach that if it breaks the laws of physics, and if it can do that, we likely have time travel so our efforts are completely for naught.
Moreover, any timelike curve can be mimicked by a near-zero-proper-time curve. Does this mean the AI can extend the window out into the future indefinitely?
No. I defined F as anything that can be reached by a timeline geodesic of length two hours (though you’re right that it needn’t be a geodesic). Just because the point can be reached by a path of zero length, doesn’t mean that it’s excluded from F.
But the AI can only reach that if it breaks the laws of physics
… As we know them now. Even then, not quite. The AI might just build a version of the Alcubierre drive, or a wormhole, or… In general, it would try to exploit any potential discrepancy between the domains of U and R.
Ok, I concede that if the AI can break physics as we understand it, the approach doesn’t work. A valid point, but a general one for all AI (if the AI can break our definitions, then even a friendly AI isn’t safe, even if the definitions in it seem perfect).
There’s a big difference between UFAI because it turned out tha peano arithmetic was inconsistant, which no-one think possible, and UFAI because our current model of physics was wrong/the true model was given negligible probability, which seems very likely.
Because having a utility function that somewhat resembles humans’ (including Eliezer’s) is part of what Eliezer means by “Friendly”.
Maybe some Friendly AIs would in fact do that. But Eliezer’s saying there’s no obvious reason why they should; why would finding that the laws of physics aren’t what we think they are cause an AI to stop acting Friendly, any more than (say) finding much more efficient algorithms for doing various things, discovering new things about other planets, watching an episode of “The Simpsons”, or any of the countless other things an AI (or indeed a human) might do from time to time?
If I’m right that #2 is part of what Eliezer is saying, maybe I should add that I think it may be missing the point Stuart_Armstrong is making, which (I think) isn’t that an otherwise-Friendly AI would discover it can violate what we currently believe to be the laws of physics and then go mad with power and cease to be Friendly, but that a purported Friendly AI design’s Friendlines might turn out to depend on assumptions about the laws of physics (e.g., via bounds on the amount of computation it could do in certain circumstances or how fast the number of intelligent agents within a given region of spacetime can grow with the size of the region, or how badly the computations it actually does can deviate from some theoretical model because of noise etc.), and if those assumptions then turned out to be wrong it would be bad.
(To which my model of Eliezer says: So don’t do that, then. And then my model of Stuart says: Avoiding it might be infeasible; there are just too many, too non-obvious, ways for a purported proof of Friendliness to depend on how physics works—and the best we can do might turn out to be something way less than an actual proof, anyway. But by now I bet my models have diverged from reality. It’s just as well I’m just chattering in an LW discussion and not trying to predict what a superintelligent machine might do.)
And as for the assumptions, I’m more worried about the definitions: what happens when the AI realises that the definition of what a “human” is turns out to be flawed.
what happens when the AI realises that the definition of what a “human” is turns out to be flawed.
The AI’s definition of “human” should be computational. If it discovers new physics, it may find additional physical process that implement that computation, but it should not get confused.
Ontological crises seems to be a problem for AIs with utility functions over arrangements of particles, but it doesn’t make much sense to me to specify our utility function that way. We don’t think of what we want as arrangements of particles, we think at a much higher level of abstraction and we would be happy with any underlying physics that implemented the features of that abstraction level. Our preferences at that high level are what should generate our preferences in terms of ontologically basic stuff whatever ontology the AI ends up using.
Our preferences at that high level are what should generate our preferences in terms of ontologically basic stuff whatever ontology the AI ends up using.
I am not sure that the higher-level of abstraction saves you from sliding into an ontological black hole. My analogy is in physics: Classical electromagnetism leads to the ultraviolet catastrophe, making this whole higher classical level unstable, until you get the lower levels “right”.
I can easily imagine that an attempt to specify a utility function over “a much higher level of abstraction” would result in a sort of “ultraviolet catastrophe” where the utility function can become unbounded at one end of the spectrum, until you fix the lower levels of abstraction.
Not sure if this is what you are asking, but a paperclip maximizer not familiar with general relativity risks creating a black hole out of paper clips, losing all its hard work as a result.
That would be a problem of the AI not being able to accurately predict the consequences of its actions because it doesn’t know enough physics. An ontological crises would involve the paperclip maximizer learning new physics and therefor getting confused about what a paperclip is and maximizing something else.
Example: An AI is introduces to a large quantity of metal, and told to make paperclips. Since the AI is confined in a metal-only environment, “paperclip” is defined only as a shape.
The AI escapes from the box, and encounters a lake. It then spends some time trying to create paperclip shapes from water. After a bit of experimentation, it finds that freezing the water to ice allows it to create paperclip shapes. Moreover, it finds that any substance provided with enough heat will melt.
Therefore, in order to better create paperclip shapes from other, possibly undiscovered materials, the AI puts out the sun, and otherwise seeks to minimise the amount of heat in the universe.
Pretty sure that freezing stuff would cost lots of negentropy which Clippy could spend to make many more paperclips out of already solid materials instead.
That is an example of a paperclip maximizer failing an ontological crises. It doesn’t seem to illustrate Shminux’s concept of an “ultraviolet catastrophe”, though.
I think that the concept of an ontological crises metaphorically similar to the ultraviolet catastrophe is confused, and I don’t expect to find a good example. I suspect that Shminux was thinking more of problems of inaccurate predictions from incomplete physics than utility functions that don’t translate correctly to new ontologies when he proposed it.
To be clear, the issue here is that it inadvertently hastens the heat death of the universe,and generally lowers it’s ability to create paperclips, right?
It’s just an example of an ontological crisis; the AI is learning new physics (cold causes water to freeze), and is not certain of what a paperclip is, and is therefore maximising something else (coldness).
and is therefore maximising something else (coldness).
The thing the paperclip maximizer is maximizing instead of paperclips is paperclip-shaped objects made out of the wrong material. Coldness is just an instrumental value, and the example could be simplified and made more plausible by taking that part out. ETA: And the relevant new physics is not that cold water freezes but that materials other than metal exist.
The thing the paperclip maximizer is maximizing instead of paperclips is paperclip-shaped objects made out of the wrong material. Coldness is just an instrumental value,
A good point. I hadn’t thought of it that way, but you are correct.
And the relevant new physics is not that cold water freezes but that materials other than metal exist.
Oh, right. But … it’s actually maximizing solids, which is instrumental to maximizing paperclip-shaped objects, which is what it was programmed to do in the first place. Right?
I think it largely comes down to how you handle divergent resources. For the ultraviolet catastrophe, let’s use the example of… the ultraviolet catastrophe.
Let’s suppose that the AI had a use for materials that emitted infinite power in thermal radiation. In fact, as the power emitted went up, the usefulness went up without bound. Photonic rocket engines for exploring the stars, perhaps, or how fast you could loop a computational equivalent of a paper clip being produced.
Now, the AI knows that the ultraviolet catastrophe doesn’t actually occur, with very good certainty. But it could get Pascal’s wagered here—it takes actions weighted both by the probability, and by the impact the action could have. So it assigns a divergent weight to actions that benefit divergently from the ultraviolet catastrophe, and builds a infinite-power computer that it knows won’t work.
So it assigns a divergent weight to actions that benefit divergently from the ultraviolet catastrophe, and builds a infinite-power computer that it knows won’t work.
How is this different to accepting a bet it “knows” it will lose? We may know with certainty that it doesn’t live in a classical universe, because we specified the problem, but the AI doesn’t.
Well, from the perspective of the AI, it’s behaving perfectly rationally. It finds the highest-probability thing that could give it infinite reward, and then prepares for that, no matter how small the probability is. It only seems strange to us humans because (1) we’re Allais-ey, and (2) it is a clear case of logical, one-shot probability, which is less intuitive.
If our AI models the world with one set of laws at a time, rather than having a probability distribution over laws, then this behavior could pop up as a surprise.
The AI’s definition of “human” should be computational. If it discovers new physics, it may find additional physical process that implement that computation, but it should not get confused.
What if it discovers new math? Less likely, I know, but...
Here’s my intuition: Eliezer and other friendly humans have got their values partially through evolution and selection. Genetic algorithms tend to be very robust—even robust to the problem not being properly specified. So I’d assume that Eliezer and evolved FAIs would preserve their friendliness if the laws of physics were changed.
An AI with a designed utility function is very different, however. These are very vulnerable to ontology crises, as they’re grounded in formal descriptions—and if the premises of the description change, their whole values change.
Now, presumably we can do better than that, and design a FAI to be robust across ontology changes—maybe mix in some evolution, or maybe some cunning mathematics. If this is possible, however, I would expect the same approach to succeed with a reduced impact AI.
I got 99 psychological drives but inclusive fitness ain’t one.
In what way is evolution supposed to be robust? It’s slow, stupid, doesn’t reproduce the content of goal systems at all and breaks as soon as you introduce it to a context sufficiently different from the environment of evolutionary ancestry because it uses no abstract reasoning in its consequentialism. It is the opposite of robust along just about every desirable dimension.
It’s not as brittle as methods like first order logic or computer programming. If I had really bad computer hardware (corrupted disks and all that), then an evolved algorithm is going to work a lot better than a lean formal program.
Similarly, if an AI was built by people who didn’t understand the concept of friendliness, I’d much prefer they used reinforcement learning or evolutionary algorithms than direct programming. With the first approaches, there is some chance the AI may infer the correct values. But with the wrong direct programming, there’s no chance of it being safe.
As you said, you’re altruistic, even if the laws of physics change—and yet you don’t have a full theory of humankind, of worth, of altruism, etc… So the mess in your genes, culture and brain has come up with something robust to ontology changes, without having to be explicit about it all. Even though evolution is not achieving it’s “goal” through you, something messy is working.
I think there might be a miscommunication going on here.
I see Stuart arguing that Genetic algorithms function independent of physics in terms of their consistent “friendly” trait. i.e. if in universe a there is a genetic algorithm that finds value in expressing the “friendly” trait, then that algorithm would, if placed in universe b (where the boundary conditions of the universe were slightly different) would tend to eventually express that “friendly” trait again. Thereby meaning robust (when compared to systems that could not do this)
I don’t necessarily agree with that argument, and my interpretation could be wrong.
I see Eliezer arguing that evolution as a system doesn’t do a heck of a lot, when compared to a system that is designed around a goal and involves compensation for failure. i.e. I can’t reproduce with a horse, this is a bad thing because if I were trapped on an island with a horse our genetic information would die off, where in a robust system, I could breed with a horse, thereby preserving our genetic information.
I’m sorry if this touches too closely on the entire “well, the dictionary says” argument.
If I had to guess Stuart_Armstrong’s meaning, I would guess that genetic algorithms are robust in that they can find a solution to a poorly specified and poorly understood problem statement. They’re not robust to dramatic changes in the environment (though they can correct for sufficiently slow, gradual changes very well); but their consequentialist nature provides some layer of protection from ontology crises.
Presumably all the math you are working on is required for your proof of friendliness? And if the assumptions behind the math do not match the physics, wouldn’t it invalidate the proof, or at least its relevance to the world we live in? And then all bets are off?
Even invalidating a proof doesn’t automatically mean the outcome is the opposite of the proof. The key question is whether there’s a cognitive search process actively looking for a way to exploit the flaws in a cage. An FAI isn’t looking for ways to stop being Friendly, quite the opposite. More to the point, it’s not actively looking for a way to make its servers or any other accessed machinery disobey the previously modeled laws of physics in a way that modifies its preferences despite the proof system. Any time you have a system which sets that up as an instrumental goal you must’ve done the Wrong Thing from an FAI perspective. In other words, there’s no super-clever being doing a cognitive search for a way to force an invalidating behavior—that’s the key difference.
The problem is that it’s a utility maximiser. If the ontology crises causes the FAI’s goals to slide a bit in the wrong direction, it may end up optimising us out of existence (even if “happy humans with worthwhile and exciting lives” is still high in its preference ordering, it might not be at the top).
This is a uniform problem among all AIs. Avoiding it is very hard. That is why such a thing as the discipline of Friendly AI exists in the first place. You do, in fact, have to specify the preference ordering sufficiently well and keep it sufficiently stable.
Stepping down from maximization is also necessary just because actual maximization is undoable, but then that also has to be kept stable (satisficers may become maximizers, etc.) and if there’s something above eudaimonia in its preference ordering it might not take very much ‘work’ to bring it into existence.
Hmm, I did not mean “actively looking”. I imagined something along the lines of being unable to tell whether something that is a good thing (say, in a CEV sense) in a model universe is good or bad in the actual universe. Presumably if you weren’t expecting this to be an issue, you would not be spending your time on non-standard numbers and other esoteric mathematical models not usually observed in the wild. Again, I must be missing something in my presumptions.
The model theory is just for understanding logic in general and things like Lob’s theorem, and possibly being able to reason about universes using second-order logic. What you’re talking about is the ontological shift problem which is a separate set of issues.
Non-trivial indeed, but why would it stop a capable AI determined to maximize its utility?
Further notes:
Geodesic is probably too weak, as it implies ballistic motion only. A timelike or null future-pointing path is somewhat safer (one should be allowed to use engines). Still, anything outside O’s original lightcone, or within a two-hour proper-time window appears to be a fair game for paperclipping.
Moreover, any timelike curve can be mimicked by a near-zero-proper-time curve. Does this mean the AI can extend the window out into the future indefinitely?
Do you mean the actual AI making an actual physical copy of the actual universe?
Added a clause that the AI must obey the laws of physics; it was implicit, but now it’s explicit.
Yes. But the AI can only reach that if it breaks the laws of physics, and if it can do that, we likely have time travel so our efforts are completely for naught.
No. I defined F as anything that can be reached by a timeline geodesic of length two hours (though you’re right that it needn’t be a geodesic). Just because the point can be reached by a path of zero length, doesn’t mean that it’s excluded from F.
… As we know them now. Even then, not quite. The AI might just build a version of the Alcubierre drive, or a wormhole, or… In general, it would try to exploit any potential discrepancy between the domains of U and R.
Ok, I concede that if the AI can break physics as we understand it, the approach doesn’t work. A valid point, but a general one for all AI (if the AI can break our definitions, then even a friendly AI isn’t safe, even if the definitions in it seem perfect).
Any other flaws in the model?
There’s a big difference between UFAI because it turned out tha peano arithmetic was inconsistant, which no-one think possible, and UFAI because our current model of physics was wrong/the true model was given negligible probability, which seems very likely.
Yes.
This is related to ontology crises—how does the AI generalise old concepts across new models of physics?
But it may be a problem for most FAI designs, as well.
Um, I wouldn’t hurt people if I discovered I could violate the laws of physics. Why should a Friendly AI?
Why shouldn’t it? To rephrase, why do you intuitively generalize your own utility function to that of a FAI?
Because having a utility function that somewhat resembles humans’ (including Eliezer’s) is part of what Eliezer means by “Friendly”.
Maybe some Friendly AIs would in fact do that. But Eliezer’s saying there’s no obvious reason why they should; why would finding that the laws of physics aren’t what we think they are cause an AI to stop acting Friendly, any more than (say) finding much more efficient algorithms for doing various things, discovering new things about other planets, watching an episode of “The Simpsons”, or any of the countless other things an AI (or indeed a human) might do from time to time?
If I’m right that #2 is part of what Eliezer is saying, maybe I should add that I think it may be missing the point Stuart_Armstrong is making, which (I think) isn’t that an otherwise-Friendly AI would discover it can violate what we currently believe to be the laws of physics and then go mad with power and cease to be Friendly, but that a purported Friendly AI design’s Friendlines might turn out to depend on assumptions about the laws of physics (e.g., via bounds on the amount of computation it could do in certain circumstances or how fast the number of intelligent agents within a given region of spacetime can grow with the size of the region, or how badly the computations it actually does can deviate from some theoretical model because of noise etc.), and if those assumptions then turned out to be wrong it would be bad.
(To which my model of Eliezer says: So don’t do that, then. And then my model of Stuart says: Avoiding it might be infeasible; there are just too many, too non-obvious, ways for a purported proof of Friendliness to depend on how physics works—and the best we can do might turn out to be something way less than an actual proof, anyway. But by now I bet my models have diverged from reality. It’s just as well I’m just chattering in an LW discussion and not trying to predict what a superintelligent machine might do.)
That model of me forced me to think of a better response :-)
http://lesswrong.com/lw/gmx/domesticating_reduced_impact_ais/8he2
And as for the assumptions, I’m more worried about the definitions: what happens when the AI realises that the definition of what a “human” is turns out to be flawed.
The AI’s definition of “human” should be computational. If it discovers new physics, it may find additional physical process that implement that computation, but it should not get confused.
Ontological crises seems to be a problem for AIs with utility functions over arrangements of particles, but it doesn’t make much sense to me to specify our utility function that way. We don’t think of what we want as arrangements of particles, we think at a much higher level of abstraction and we would be happy with any underlying physics that implemented the features of that abstraction level. Our preferences at that high level are what should generate our preferences in terms of ontologically basic stuff whatever ontology the AI ends up using.
Right—that’s the obvious angle of attack for handling ontological crises.
I am not sure that the higher-level of abstraction saves you from sliding into an ontological black hole. My analogy is in physics: Classical electromagnetism leads to the ultraviolet catastrophe, making this whole higher classical level unstable, until you get the lower levels “right”.
I can easily imagine that an attempt to specify a utility function over “a much higher level of abstraction” would result in a sort of “ultraviolet catastrophe” where the utility function can become unbounded at one end of the spectrum, until you fix the lower levels of abstraction.
Can you give me an example of an ultraviolet catastrophe, say for paperclips?
Not sure if this is what you are asking, but a paperclip maximizer not familiar with general relativity risks creating a black hole out of paper clips, losing all its hard work as a result.
That would be a problem of the AI not being able to accurately predict the consequences of its actions because it doesn’t know enough physics. An ontological crises would involve the paperclip maximizer learning new physics and therefor getting confused about what a paperclip is and maximizing something else.
Example: An AI is introduces to a large quantity of metal, and told to make paperclips. Since the AI is confined in a metal-only environment, “paperclip” is defined only as a shape.
The AI escapes from the box, and encounters a lake. It then spends some time trying to create paperclip shapes from water. After a bit of experimentation, it finds that freezing the water to ice allows it to create paperclip shapes. Moreover, it finds that any substance provided with enough heat will melt.
Therefore, in order to better create paperclip shapes from other, possibly undiscovered materials, the AI puts out the sun, and otherwise seeks to minimise the amount of heat in the universe.
Is that what you’re looking for?
Pretty sure that freezing stuff would cost lots of negentropy which Clippy could spend to make many more paperclips out of already solid materials instead.
That is an example of a paperclip maximizer failing an ontological crises. It doesn’t seem to illustrate Shminux’s concept of an “ultraviolet catastrophe”, though.
You are correct. Can you suggest an example that resolves that shortcoming?
I think that the concept of an ontological crises metaphorically similar to the ultraviolet catastrophe is confused, and I don’t expect to find a good example. I suspect that Shminux was thinking more of problems of inaccurate predictions from incomplete physics than utility functions that don’t translate correctly to new ontologies when he proposed it.
To be clear, the issue here is that it inadvertently hastens the heat death of the universe,and generally lowers it’s ability to create paperclips, right?
It’s just an example of an ontological crisis; the AI is learning new physics (cold causes water to freeze), and is not certain of what a paperclip is, and is therefore maximising something else (coldness).
The thing the paperclip maximizer is maximizing instead of paperclips is paperclip-shaped objects made out of the wrong material. Coldness is just an instrumental value, and the example could be simplified and made more plausible by taking that part out. ETA: And the relevant new physics is not that cold water freezes but that materials other than metal exist.
A good point. I hadn’t thought of it that way, but you are correct.
Exactly, yes.
Oh, right. But … it’s actually maximizing solids, which is instrumental to maximizing paperclip-shaped objects, which is what it was programmed to do in the first place. Right?
Yyyyyeeeees. That’s a fair statement of the situation.
Just checking I understand it this time, thanks :-)
Oh, OK. What are the abstraction levels a paperclip maximizer might use?
Hm.
I think it largely comes down to how you handle divergent resources. For the ultraviolet catastrophe, let’s use the example of… the ultraviolet catastrophe.
Let’s suppose that the AI had a use for materials that emitted infinite power in thermal radiation. In fact, as the power emitted went up, the usefulness went up without bound. Photonic rocket engines for exploring the stars, perhaps, or how fast you could loop a computational equivalent of a paper clip being produced.
Now, the AI knows that the ultraviolet catastrophe doesn’t actually occur, with very good certainty. But it could get Pascal’s wagered here—it takes actions weighted both by the probability, and by the impact the action could have. So it assigns a divergent weight to actions that benefit divergently from the ultraviolet catastrophe, and builds a infinite-power computer that it knows won’t work.
How is this different to accepting a bet it “knows” it will lose? We may know with certainty that it doesn’t live in a classical universe, because we specified the problem, but the AI doesn’t.
Well, from the perspective of the AI, it’s behaving perfectly rationally. It finds the highest-probability thing that could give it infinite reward, and then prepares for that, no matter how small the probability is. It only seems strange to us humans because (1) we’re Allais-ey, and (2) it is a clear case of logical, one-shot probability, which is less intuitive.
If our AI models the world with one set of laws at a time, rather than having a probability distribution over laws, then this behavior could pop up as a surprise.
Precisely. That’s all I was saying.
What if it discovers new math? Less likely, I know, but...
Here’s my intuition: Eliezer and other friendly humans have got their values partially through evolution and selection. Genetic algorithms tend to be very robust—even robust to the problem not being properly specified. So I’d assume that Eliezer and evolved FAIs would preserve their friendliness if the laws of physics were changed.
An AI with a designed utility function is very different, however. These are very vulnerable to ontology crises, as they’re grounded in formal descriptions—and if the premises of the description change, their whole values change.
Now, presumably we can do better than that, and design a FAI to be robust across ontology changes—maybe mix in some evolution, or maybe some cunning mathematics. If this is possible, however, I would expect the same approach to succeed with a reduced impact AI.
I got 99 psychological drives but inclusive fitness ain’t one.
In what way is evolution supposed to be robust? It’s slow, stupid, doesn’t reproduce the content of goal systems at all and breaks as soon as you introduce it to a context sufficiently different from the environment of evolutionary ancestry because it uses no abstract reasoning in its consequentialism. It is the opposite of robust along just about every desirable dimension.
It’s not as brittle as methods like first order logic or computer programming. If I had really bad computer hardware (corrupted disks and all that), then an evolved algorithm is going to work a lot better than a lean formal program.
Similarly, if an AI was built by people who didn’t understand the concept of friendliness, I’d much prefer they used reinforcement learning or evolutionary algorithms than direct programming. With the first approaches, there is some chance the AI may infer the correct values. But with the wrong direct programming, there’s no chance of it being safe.
As you said, you’re altruistic, even if the laws of physics change—and yet you don’t have a full theory of humankind, of worth, of altruism, etc… So the mess in your genes, culture and brain has come up with something robust to ontology changes, without having to be explicit about it all. Even though evolution is not achieving it’s “goal” through you, something messy is working.
I think there might be a miscommunication going on here.
I see Stuart arguing that Genetic algorithms function independent of physics in terms of their consistent “friendly” trait. i.e. if in universe a there is a genetic algorithm that finds value in expressing the “friendly” trait, then that algorithm would, if placed in universe b (where the boundary conditions of the universe were slightly different) would tend to eventually express that “friendly” trait again. Thereby meaning robust (when compared to systems that could not do this)
I don’t necessarily agree with that argument, and my interpretation could be wrong.
I see Eliezer arguing that evolution as a system doesn’t do a heck of a lot, when compared to a system that is designed around a goal and involves compensation for failure. i.e. I can’t reproduce with a horse, this is a bad thing because if I were trapped on an island with a horse our genetic information would die off, where in a robust system, I could breed with a horse, thereby preserving our genetic information.
I’m sorry if this touches too closely on the entire “well, the dictionary says” argument.
Oh, now I feel silly. The horse IS the other universe.
If I had to guess Stuart_Armstrong’s meaning, I would guess that genetic algorithms are robust in that they can find a solution to a poorly specified and poorly understood problem statement. They’re not robust to dramatic changes in the environment (though they can correct for sufficiently slow, gradual changes very well); but their consequentialist nature provides some layer of protection from ontology crises.
You know this is blank, right?
I had a response that was mainly a minor nitpick; it didn’t add anything, so I removed it.
Just checking.
Presumably all the math you are working on is required for your proof of friendliness? And if the assumptions behind the math do not match the physics, wouldn’t it invalidate the proof, or at least its relevance to the world we live in? And then all bets are off?
Even invalidating a proof doesn’t automatically mean the outcome is the opposite of the proof. The key question is whether there’s a cognitive search process actively looking for a way to exploit the flaws in a cage. An FAI isn’t looking for ways to stop being Friendly, quite the opposite. More to the point, it’s not actively looking for a way to make its servers or any other accessed machinery disobey the previously modeled laws of physics in a way that modifies its preferences despite the proof system. Any time you have a system which sets that up as an instrumental goal you must’ve done the Wrong Thing from an FAI perspective. In other words, there’s no super-clever being doing a cognitive search for a way to force an invalidating behavior—that’s the key difference.
The problem is that it’s a utility maximiser. If the ontology crises causes the FAI’s goals to slide a bit in the wrong direction, it may end up optimising us out of existence (even if “happy humans with worthwhile and exciting lives” is still high in its preference ordering, it might not be at the top).
This is a uniform problem among all AIs. Avoiding it is very hard. That is why such a thing as the discipline of Friendly AI exists in the first place. You do, in fact, have to specify the preference ordering sufficiently well and keep it sufficiently stable.
Stepping down from maximization is also necessary just because actual maximization is undoable, but then that also has to be kept stable (satisficers may become maximizers, etc.) and if there’s something above eudaimonia in its preference ordering it might not take very much ‘work’ to bring it into existence.
Hmm, I did not mean “actively looking”. I imagined something along the lines of being unable to tell whether something that is a good thing (say, in a CEV sense) in a model universe is good or bad in the actual universe. Presumably if you weren’t expecting this to be an issue, you would not be spending your time on non-standard numbers and other esoteric mathematical models not usually observed in the wild. Again, I must be missing something in my presumptions.
The model theory is just for understanding logic in general and things like Lob’s theorem, and possibly being able to reason about universes using second-order logic. What you’re talking about is the ontological shift problem which is a separate set of issues.