Think of something you currently value, the more highly valued the better. You don’t need to say what it is, but it does need to be something that seriously matters to you. Not just something you enjoy, but something that you believe is truly worthwhile.
I could try to give examples, but the thought exercise only works if it’s about what you value, not me.
Now imagine that you could press a button so that you no longer care about it at all, or even actively despise it. Would you press that button? Why, or why not?
I definitely wouldn’t press that button. And I understand that you’re demonstrating the general principle that you should try to preserve your utility function. And I agree with this.
But what I’m saying is that the AI, by exploiting hardware-level vulnerabilities, isn’t changing its utility function. The actual utility function, as implemented by the programmers, returns 999999999 for some possible world due to the hardware-level imperfections in modern computers.
In the spirit of your example, I’ll give another example that I think demonstrates the problem:
First, note that brains don’t always function as we’d like, just like computers. Imagine there is a very specific thought about a possible future that, when considered, makes you imagine that future as extremely desirable. It seems so desirable to you that, once you thought of it, you woiuld pursue it relentlessly. But this future isn’t one that would normally be considered desirable. It might just be about paperclips or something. However, that very specific way of thinking about it would “hack” your brain, making you view that future as desirable even though it would normally be seen as arbitrary.
Then, if you even happen upon that thought, you would try to send the world in that arbitrary direction.
Hopefully, you could prevent this from happening. If you reason in the abstract that you could have those sorts of thoughts, and that they would be bad, then you could take corrective action. But this requires that you do find out that thinking those sorts of thoughts would be bad before concretely finding those thoughts. Then you could apply some change to your mental routine or something to avoid thinking those thoughts.
And if I had to guess, I bet an AI would also be able to do the same thing and everything would work out fine. But maybe not. Imagine the AI consider an absolutely enormous number of possible worlds before taking its first action. And imagine and even found a way to “hack” its utility function in that very first time step. Then there’s no way the AI could make preventative action: It’s already thought up the high-utility world from hardware unreliability and now is trying to pursue that world.
I’m confused. In the original comments you’re talking about a super-intelligent AI noting a exploitable hardware flaw in itself and deliberately using that error to hack its utility function using something like rowhammer exploit.
Then you say that the utility function already had an error in it from the start and the AI isn’t using its intelligence to do anything except note that it has this flaw. Then introduce an analogy in which I have a brain flaw that under some bizarre circumstances will turn me into a paperclip maximizer, and I am aware that I have it.
In this analogy, I’m doing what? Deliberately taking drugs and using guided meditation to rowhammer my brain into becoming a paperclip maximizer?
I think had been unclear in my original presentation. I’m sorry for that. To clarify, the AI is never changing the code of its utility function. Instead, it’s merely finding an input that, through some hardware-level bug, causes it to produce outputs in conflict with the mathematical specification. I know “hack the utility function” makes it sound like the actual code in the utility function was modified; describing it that way was a mistake on my part.
I had tried to make the analogy to more intuitively explain my idea, but it didn’t seem to work. If you want to better understand my train of thought, I suggest reading the comments between Vladmir and I.
In the analogy, you aren’t doing anything to deliberately make yourself a paperclip maximizer. You just happen to think of a thought that turned you into a paperclip maximizer. But, on reflection, I think that this is a bizarre and rather stupid metaphor. And the situation is sufficiently different from the one with AI that I don’t even think it’s really informative of what I think could happen to an AI.
Ah okay, so we’re talking about a bug in the hardware implementation of an AI. Yes, that can certainly happen and will contribute some probability mass to alignment failure, though probably very little by comparison with all the other failure modes.
Yes, that can certainly happen and will contribute some probability mass to alignment failure, though probably very little by comparison with all the other failure modes.
Could you explain why you think it has very little probability mass compared to the others? A bug in a hardware implementation is not in the slightest far-fetched: I think that modern computers in general have exploitable hardware bugs. That’s why row-hammer attacks exist. The computer you’re reading this on could probably get hacked through hardware-bug exploitation.
The question is whether the AI can find the potential problem with its future utility function and fix it before coming across the error-causing possible world.
There’s a huge gulf between “far-fetched” and “quite likely”.
The two big ones are failure to work out how to create an aligned AI at all, and failure to train and/or code a correctly designed aligned AI. In my opinion the first accounts for at least 80% of the probability mass, and the second most of the remainder. We utterly suck at writing reliable software in every field, and this has been amply borne out in not just thousands of failures, but thousands of types of failures.
By comparison, we’re fairly good at creating at least moderately reliable hardware, and most of the accidental failure modes are fatal to the running software. Flaws like rowhammer are mostly attacks, where someone puts a great deal of intelligent effort into finding an extremely unusual operating mode in which some some assumptions can be bypassed with significant effort into creating exactly the wrong operating conditions.
There are some examples of accidental flaws that affect hardware and aren’t fatal to its running software, but they’re an insignificant fraction of the number of failures due to incorrect software.
I agree that people are good at making hardware that works reasonably reliably. And I think that if you were to make an arbitrary complex program, the probability that it would fail from hardware-related bugs would be far lower than the probability of it failing for some other reason.
But the point I’m trying to make is that an AI, it seems to me, would be vastly more likely to run into something that exploits a hardware-level bug than an arbitrary complex program. For details on why I imagine so, please see this comment.
I’m trying to anticipate where someone could be confused about the comment I linked to, so I want to clarify something. Let S be the statement, “The AI comes across a possible world that causes its utility function to return very high value due to hardware bug exploitation”. Then it’s true that, if the AI has yet to find the error-causing world, the AI would not want to find it. Because utility(S) is low. However, this does not mean that the AI’s planning or optimization algorithm exerts no optimization pressure towards finding S.
Imagine the AI’s optimization algorithm as a black boxes that take as input a utility function and search space and output solutions that score highly on its utility function. Given that we don’t know what future AI will look like, I don’t think we can have a model of the AI much more informative than the above. And the hardware-error-caused world could score very, very highly on the utility function, much more so than any non-hardware-error-caused world. So I don’t think it should be too surprising if a powerful optimization algorithm finds it.
Yes, utility(S) is low, but that doesn’t mean the optimization actually calls utility(S) or uses it to adjust how it searches.
I think there are at least three different things being called “the utility function” here, and that’s causing confusion:
The utility function as specified in the software, mapping possible worlds to values. Let’s call this S.
The utility function as it is implemented running on actual hardware. Let’s call this H.
A representation of the utility function that can be passed as data to a black box optimizer. Let’s call this R.
You seem to be saying that in the software design of your AI, R = H. That is, that the black box will be given some data representing the Al’s hardware and other constraints, and return a possible world maximizing H.
From my point of view, that’s already a design fault. The designers of this AI want S maximized, not H. The AI itself wants S maximized instead of H in all circumstances where the hardware flaw doesn’t trigger. Who chose to pass H into the optimizer?
You seem to be saying that in the software design of your AI, R = H. That is, that the black box will be given some data representing the Al’s hardware and other constraints, and return a possible world maximizing H.
From my point of view, that’s already a design fault.
I agree; this is a design flaw. The issue is, I have yet to come across any optimization, planning algorithm, or AI architecture that doesn’t have this design flaw.
That is, I don’t know of any AI architecture that does not involve using a potentially hardware-bug-exploitable utility function as input into some planning or optimization problem. And I’m not sure there even is one.
In the rest of this comment I’ll just suggest approaches and show how they are still vulnerable to the hardware-bug-exploitation problem.
I have some degree of background in artificial intelligence, and the planning and optimization algorithms I’ve seen take the function to be maximized as an input parameter. Then, when people want to make an AI, they just call that planning or optimization algorithm with their (hardware-bug-exploitable) utility or cost functions. For example, suppose someone wants to make a plan that minimizes cost function f in search space s. Then I think they just directly do something like:
return a_star(f, s)
And this doesn’t provide any protection from hardware-level exploitation.
Now, correct me if I’m wrong, but it seems your thinking of the AI first doing some pre-processing to find an input to the planning or optimization algorithm that is resistant to hardware-bug-exploitation.
But how do you actually do that? You could regard the input the AI puts into the optimization function to be a choice it makes. But how does it make this choice? The only thing I can think of is having a planning or optimization algorithm figure out out what function to use as the input to the optimization or planning algorithm.
But if you need to use a planning or optimization algorithm to do this, then what utility function do you pass into this planning or optimization algorithm? You could try to pass the actual, current, hardware-bug-exploitable utility function. But then this doesn’t resolve the problem of hardware-bug-exploitation: when coming up with a utility function to input to the optimization, the AI may find such an input that itself scores very high due to hardware bug exploitation.
To describe the above more concretely, you could try doing something like this:
with its hardware-bug-exploitable utility function. Thus, the output, reasonable_utility_function_use, might be very wrong due to hardware bug exploitation having been used to come up with this.
Now, you might have had some other idea in mind. I don’t know of a concrete way to get around this problem, so I’m very interested to hear your thoughts.
My concern is that people will figure out how to make powerful optimization and planning algorithms without first figuring out how to fix this design flaw.
The issue is, I have yet to come across any optimization, planning algorithm, or AI architecture that doesn’t have this design flaw.
Yes you have. None of the these optimization procedures analyze the hardware implementation of a function in order to maximize it.
The rest of your comment is irrelevant, because what you have been describing is vastly worse than merely calling the function. If you merely call the function, you won’t find these hardware exploits. You only find them when analyzing the implementation. But the optimizer isn’t given access to the implementation details, only to the results.
If you prefer, you can cast the problem in terms of differing search spaces. As designed, the function U maps representations of possible worlds to utility values. When optimizing, you make various assumptions about the structure of the function—usually assumed to be continuous, sometimes differentiable, but in particular you always assume that it’s a function of its input.
The fault means that under some conditions that are extremely unlikely in practice, the value returned is not a function of the input. It’s a function of input and a history of the hardware implementing it. There is no way for the optimizer to determine this, or anything about the conditions that might trigger it, because they are outside its search space. The only way to get an optimizer that searches for such hardware flaws is to design it to search for them.
In other words pass the hardware design, not just the results of evaluation, to a suitably powerful optimizer.
Think of something you currently value, the more highly valued the better. You don’t need to say what it is, but it does need to be something that seriously matters to you. Not just something you enjoy, but something that you believe is truly worthwhile.
I could try to give examples, but the thought exercise only works if it’s about what you value, not me.
Now imagine that you could press a button so that you no longer care about it at all, or even actively despise it. Would you press that button? Why, or why not?
I definitely wouldn’t press that button. And I understand that you’re demonstrating the general principle that you should try to preserve your utility function. And I agree with this.
But what I’m saying is that the AI, by exploiting hardware-level vulnerabilities, isn’t changing its utility function. The actual utility function, as implemented by the programmers, returns 999999999 for some possible world due to the hardware-level imperfections in modern computers.
In the spirit of your example, I’ll give another example that I think demonstrates the problem:
First, note that brains don’t always function as we’d like, just like computers. Imagine there is a very specific thought about a possible future that, when considered, makes you imagine that future as extremely desirable. It seems so desirable to you that, once you thought of it, you woiuld pursue it relentlessly. But this future isn’t one that would normally be considered desirable. It might just be about paperclips or something. However, that very specific way of thinking about it would “hack” your brain, making you view that future as desirable even though it would normally be seen as arbitrary.
Then, if you even happen upon that thought, you would try to send the world in that arbitrary direction.
Hopefully, you could prevent this from happening. If you reason in the abstract that you could have those sorts of thoughts, and that they would be bad, then you could take corrective action. But this requires that you do find out that thinking those sorts of thoughts would be bad before concretely finding those thoughts. Then you could apply some change to your mental routine or something to avoid thinking those thoughts.
And if I had to guess, I bet an AI would also be able to do the same thing and everything would work out fine. But maybe not. Imagine the AI consider an absolutely enormous number of possible worlds before taking its first action. And imagine and even found a way to “hack” its utility function in that very first time step. Then there’s no way the AI could make preventative action: It’s already thought up the high-utility world from hardware unreliability and now is trying to pursue that world.
I’m confused. In the original comments you’re talking about a super-intelligent AI noting a exploitable hardware flaw in itself and deliberately using that error to hack its utility function using something like rowhammer exploit.
Then you say that the utility function already had an error in it from the start and the AI isn’t using its intelligence to do anything except note that it has this flaw. Then introduce an analogy in which I have a brain flaw that under some bizarre circumstances will turn me into a paperclip maximizer, and I am aware that I have it.
In this analogy, I’m doing what? Deliberately taking drugs and using guided meditation to rowhammer my brain into becoming a paperclip maximizer?
I think had been unclear in my original presentation. I’m sorry for that. To clarify, the AI is never changing the code of its utility function. Instead, it’s merely finding an input that, through some hardware-level bug, causes it to produce outputs in conflict with the mathematical specification. I know “hack the utility function” makes it sound like the actual code in the utility function was modified; describing it that way was a mistake on my part.
I had tried to make the analogy to more intuitively explain my idea, but it didn’t seem to work. If you want to better understand my train of thought, I suggest reading the comments between Vladmir and I.
In the analogy, you aren’t doing anything to deliberately make yourself a paperclip maximizer. You just happen to think of a thought that turned you into a paperclip maximizer. But, on reflection, I think that this is a bizarre and rather stupid metaphor. And the situation is sufficiently different from the one with AI that I don’t even think it’s really informative of what I think could happen to an AI.
Ah okay, so we’re talking about a bug in the hardware implementation of an AI. Yes, that can certainly happen and will contribute some probability mass to alignment failure, though probably very little by comparison with all the other failure modes.
Could you explain why you think it has very little probability mass compared to the others? A bug in a hardware implementation is not in the slightest far-fetched: I think that modern computers in general have exploitable hardware bugs. That’s why row-hammer attacks exist. The computer you’re reading this on could probably get hacked through hardware-bug exploitation.
The question is whether the AI can find the potential problem with its future utility function and fix it before coming across the error-causing possible world.
There’s a huge gulf between “far-fetched” and “quite likely”.
The two big ones are failure to work out how to create an aligned AI at all, and failure to train and/or code a correctly designed aligned AI. In my opinion the first accounts for at least 80% of the probability mass, and the second most of the remainder. We utterly suck at writing reliable software in every field, and this has been amply borne out in not just thousands of failures, but thousands of types of failures.
By comparison, we’re fairly good at creating at least moderately reliable hardware, and most of the accidental failure modes are fatal to the running software. Flaws like rowhammer are mostly attacks, where someone puts a great deal of intelligent effort into finding an extremely unusual operating mode in which some some assumptions can be bypassed with significant effort into creating exactly the wrong operating conditions.
There are some examples of accidental flaws that affect hardware and aren’t fatal to its running software, but they’re an insignificant fraction of the number of failures due to incorrect software.
I agree that people are good at making hardware that works reasonably reliably. And I think that if you were to make an arbitrary complex program, the probability that it would fail from hardware-related bugs would be far lower than the probability of it failing for some other reason.
But the point I’m trying to make is that an AI, it seems to me, would be vastly more likely to run into something that exploits a hardware-level bug than an arbitrary complex program. For details on why I imagine so, please see this comment.
I’m trying to anticipate where someone could be confused about the comment I linked to, so I want to clarify something. Let S be the statement, “The AI comes across a possible world that causes its utility function to return very high value due to hardware bug exploitation”. Then it’s true that, if the AI has yet to find the error-causing world, the AI would not want to find it. Because utility(S) is low. However, this does not mean that the AI’s planning or optimization algorithm exerts no optimization pressure towards finding S.
Imagine the AI’s optimization algorithm as a black boxes that take as input a utility function and search space and output solutions that score highly on its utility function. Given that we don’t know what future AI will look like, I don’t think we can have a model of the AI much more informative than the above. And the hardware-error-caused world could score very, very highly on the utility function, much more so than any non-hardware-error-caused world. So I don’t think it should be too surprising if a powerful optimization algorithm finds it.
Yes, utility(S) is low, but that doesn’t mean the optimization actually calls utility(S) or uses it to adjust how it searches.
I think there are at least three different things being called “the utility function” here, and that’s causing confusion:
The utility function as specified in the software, mapping possible worlds to values. Let’s call this S.
The utility function as it is implemented running on actual hardware. Let’s call this H.
A representation of the utility function that can be passed as data to a black box optimizer. Let’s call this R.
You seem to be saying that in the software design of your AI, R = H. That is, that the black box will be given some data representing the Al’s hardware and other constraints, and return a possible world maximizing H.
From my point of view, that’s already a design fault. The designers of this AI want S maximized, not H. The AI itself wants S maximized instead of H in all circumstances where the hardware flaw doesn’t trigger. Who chose to pass H into the optimizer?
I agree; this is a design flaw. The issue is, I have yet to come across any optimization, planning algorithm, or AI architecture that doesn’t have this design flaw.
That is, I don’t know of any AI architecture that does not involve using a potentially hardware-bug-exploitable utility function as input into some planning or optimization problem. And I’m not sure there even is one.
In the rest of this comment I’ll just suggest approaches and show how they are still vulnerable to the hardware-bug-exploitation problem.
I have some degree of background in artificial intelligence, and the planning and optimization algorithms I’ve seen take the function to be maximized as an input parameter. Then, when people want to make an AI, they just call that planning or optimization algorithm with their (hardware-bug-exploitable) utility or cost functions. For example, suppose someone wants to make a plan that minimizes cost function f in search space s. Then I think they just directly do something like:
And this doesn’t provide any protection from hardware-level exploitation.
Now, correct me if I’m wrong, but it seems your thinking of the AI first doing some pre-processing to find an input to the planning or optimization algorithm that is resistant to hardware-bug-exploitation.
But how do you actually do that? You could regard the input the AI puts into the optimization function to be a choice it makes. But how does it make this choice? The only thing I can think of is having a planning or optimization algorithm figure out out what function to use as the input to the optimization or planning algorithm.
But if you need to use a planning or optimization algorithm to do this, then what utility function do you pass into this planning or optimization algorithm? You could try to pass the actual, current, hardware-bug-exploitable utility function. But then this doesn’t resolve the problem of hardware-bug-exploitation: when coming up with a utility function to input to the optimization, the AI may find such an input that itself scores very high due to hardware bug exploitation.
To describe the above more concretely, you could try doing something like this:
That is, the AI above uses its own utility function to pick out a utility function to use as input to its planning algorithm.
As you can see, the above code is still vulnerable to hardware-bug exploitation. This is because it calls,
with its hardware-bug-exploitable utility function. Thus, the output, reasonable_utility_function_use, might be very wrong due to hardware bug exploitation having been used to come up with this.
Now, you might have had some other idea in mind. I don’t know of a concrete way to get around this problem, so I’m very interested to hear your thoughts.
My concern is that people will figure out how to make powerful optimization and planning algorithms without first figuring out how to fix this design flaw.
Yes you have. None of the these optimization procedures analyze the hardware implementation of a function in order to maximize it.
The rest of your comment is irrelevant, because what you have been describing is vastly worse than merely calling the function. If you merely call the function, you won’t find these hardware exploits. You only find them when analyzing the implementation. But the optimizer isn’t given access to the implementation details, only to the results.
If you prefer, you can cast the problem in terms of differing search spaces. As designed, the function U maps representations of possible worlds to utility values. When optimizing, you make various assumptions about the structure of the function—usually assumed to be continuous, sometimes differentiable, but in particular you always assume that it’s a function of its input.
The fault means that under some conditions that are extremely unlikely in practice, the value returned is not a function of the input. It’s a function of input and a history of the hardware implementing it. There is no way for the optimizer to determine this, or anything about the conditions that might trigger it, because they are outside its search space. The only way to get an optimizer that searches for such hardware flaws is to design it to search for them.
In other words pass the hardware design, not just the results of evaluation, to a suitably powerful optimizer.