I’m confused. In the original comments you’re talking about a super-intelligent AI noting a exploitable hardware flaw in itself and deliberately using that error to hack its utility function using something like rowhammer exploit.
Then you say that the utility function already had an error in it from the start and the AI isn’t using its intelligence to do anything except note that it has this flaw. Then introduce an analogy in which I have a brain flaw that under some bizarre circumstances will turn me into a paperclip maximizer, and I am aware that I have it.
In this analogy, I’m doing what? Deliberately taking drugs and using guided meditation to rowhammer my brain into becoming a paperclip maximizer?
I think had been unclear in my original presentation. I’m sorry for that. To clarify, the AI is never changing the code of its utility function. Instead, it’s merely finding an input that, through some hardware-level bug, causes it to produce outputs in conflict with the mathematical specification. I know “hack the utility function” makes it sound like the actual code in the utility function was modified; describing it that way was a mistake on my part.
I had tried to make the analogy to more intuitively explain my idea, but it didn’t seem to work. If you want to better understand my train of thought, I suggest reading the comments between Vladmir and I.
In the analogy, you aren’t doing anything to deliberately make yourself a paperclip maximizer. You just happen to think of a thought that turned you into a paperclip maximizer. But, on reflection, I think that this is a bizarre and rather stupid metaphor. And the situation is sufficiently different from the one with AI that I don’t even think it’s really informative of what I think could happen to an AI.
Ah okay, so we’re talking about a bug in the hardware implementation of an AI. Yes, that can certainly happen and will contribute some probability mass to alignment failure, though probably very little by comparison with all the other failure modes.
Yes, that can certainly happen and will contribute some probability mass to alignment failure, though probably very little by comparison with all the other failure modes.
Could you explain why you think it has very little probability mass compared to the others? A bug in a hardware implementation is not in the slightest far-fetched: I think that modern computers in general have exploitable hardware bugs. That’s why row-hammer attacks exist. The computer you’re reading this on could probably get hacked through hardware-bug exploitation.
The question is whether the AI can find the potential problem with its future utility function and fix it before coming across the error-causing possible world.
There’s a huge gulf between “far-fetched” and “quite likely”.
The two big ones are failure to work out how to create an aligned AI at all, and failure to train and/or code a correctly designed aligned AI. In my opinion the first accounts for at least 80% of the probability mass, and the second most of the remainder. We utterly suck at writing reliable software in every field, and this has been amply borne out in not just thousands of failures, but thousands of types of failures.
By comparison, we’re fairly good at creating at least moderately reliable hardware, and most of the accidental failure modes are fatal to the running software. Flaws like rowhammer are mostly attacks, where someone puts a great deal of intelligent effort into finding an extremely unusual operating mode in which some some assumptions can be bypassed with significant effort into creating exactly the wrong operating conditions.
There are some examples of accidental flaws that affect hardware and aren’t fatal to its running software, but they’re an insignificant fraction of the number of failures due to incorrect software.
I agree that people are good at making hardware that works reasonably reliably. And I think that if you were to make an arbitrary complex program, the probability that it would fail from hardware-related bugs would be far lower than the probability of it failing for some other reason.
But the point I’m trying to make is that an AI, it seems to me, would be vastly more likely to run into something that exploits a hardware-level bug than an arbitrary complex program. For details on why I imagine so, please see this comment.
I’m trying to anticipate where someone could be confused about the comment I linked to, so I want to clarify something. Let S be the statement, “The AI comes across a possible world that causes its utility function to return very high value due to hardware bug exploitation”. Then it’s true that, if the AI has yet to find the error-causing world, the AI would not want to find it. Because utility(S) is low. However, this does not mean that the AI’s planning or optimization algorithm exerts no optimization pressure towards finding S.
Imagine the AI’s optimization algorithm as a black boxes that take as input a utility function and search space and output solutions that score highly on its utility function. Given that we don’t know what future AI will look like, I don’t think we can have a model of the AI much more informative than the above. And the hardware-error-caused world could score very, very highly on the utility function, much more so than any non-hardware-error-caused world. So I don’t think it should be too surprising if a powerful optimization algorithm finds it.
Yes, utility(S) is low, but that doesn’t mean the optimization actually calls utility(S) or uses it to adjust how it searches.
I think there are at least three different things being called “the utility function” here, and that’s causing confusion:
The utility function as specified in the software, mapping possible worlds to values. Let’s call this S.
The utility function as it is implemented running on actual hardware. Let’s call this H.
A representation of the utility function that can be passed as data to a black box optimizer. Let’s call this R.
You seem to be saying that in the software design of your AI, R = H. That is, that the black box will be given some data representing the Al’s hardware and other constraints, and return a possible world maximizing H.
From my point of view, that’s already a design fault. The designers of this AI want S maximized, not H. The AI itself wants S maximized instead of H in all circumstances where the hardware flaw doesn’t trigger. Who chose to pass H into the optimizer?
You seem to be saying that in the software design of your AI, R = H. That is, that the black box will be given some data representing the Al’s hardware and other constraints, and return a possible world maximizing H.
From my point of view, that’s already a design fault.
I agree; this is a design flaw. The issue is, I have yet to come across any optimization, planning algorithm, or AI architecture that doesn’t have this design flaw.
That is, I don’t know of any AI architecture that does not involve using a potentially hardware-bug-exploitable utility function as input into some planning or optimization problem. And I’m not sure there even is one.
In the rest of this comment I’ll just suggest approaches and show how they are still vulnerable to the hardware-bug-exploitation problem.
I have some degree of background in artificial intelligence, and the planning and optimization algorithms I’ve seen take the function to be maximized as an input parameter. Then, when people want to make an AI, they just call that planning or optimization algorithm with their (hardware-bug-exploitable) utility or cost functions. For example, suppose someone wants to make a plan that minimizes cost function f in search space s. Then I think they just directly do something like:
return a_star(f, s)
And this doesn’t provide any protection from hardware-level exploitation.
Now, correct me if I’m wrong, but it seems your thinking of the AI first doing some pre-processing to find an input to the planning or optimization algorithm that is resistant to hardware-bug-exploitation.
But how do you actually do that? You could regard the input the AI puts into the optimization function to be a choice it makes. But how does it make this choice? The only thing I can think of is having a planning or optimization algorithm figure out out what function to use as the input to the optimization or planning algorithm.
But if you need to use a planning or optimization algorithm to do this, then what utility function do you pass into this planning or optimization algorithm? You could try to pass the actual, current, hardware-bug-exploitable utility function. But then this doesn’t resolve the problem of hardware-bug-exploitation: when coming up with a utility function to input to the optimization, the AI may find such an input that itself scores very high due to hardware bug exploitation.
To describe the above more concretely, you could try doing something like this:
with its hardware-bug-exploitable utility function. Thus, the output, reasonable_utility_function_use, might be very wrong due to hardware bug exploitation having been used to come up with this.
Now, you might have had some other idea in mind. I don’t know of a concrete way to get around this problem, so I’m very interested to hear your thoughts.
My concern is that people will figure out how to make powerful optimization and planning algorithms without first figuring out how to fix this design flaw.
The issue is, I have yet to come across any optimization, planning algorithm, or AI architecture that doesn’t have this design flaw.
Yes you have. None of the these optimization procedures analyze the hardware implementation of a function in order to maximize it.
The rest of your comment is irrelevant, because what you have been describing is vastly worse than merely calling the function. If you merely call the function, you won’t find these hardware exploits. You only find them when analyzing the implementation. But the optimizer isn’t given access to the implementation details, only to the results.
If you prefer, you can cast the problem in terms of differing search spaces. As designed, the function U maps representations of possible worlds to utility values. When optimizing, you make various assumptions about the structure of the function—usually assumed to be continuous, sometimes differentiable, but in particular you always assume that it’s a function of its input.
The fault means that under some conditions that are extremely unlikely in practice, the value returned is not a function of the input. It’s a function of input and a history of the hardware implementing it. There is no way for the optimizer to determine this, or anything about the conditions that might trigger it, because they are outside its search space. The only way to get an optimizer that searches for such hardware flaws is to design it to search for them.
In other words pass the hardware design, not just the results of evaluation, to a suitably powerful optimizer.
I’m confused. In the original comments you’re talking about a super-intelligent AI noting a exploitable hardware flaw in itself and deliberately using that error to hack its utility function using something like rowhammer exploit.
Then you say that the utility function already had an error in it from the start and the AI isn’t using its intelligence to do anything except note that it has this flaw. Then introduce an analogy in which I have a brain flaw that under some bizarre circumstances will turn me into a paperclip maximizer, and I am aware that I have it.
In this analogy, I’m doing what? Deliberately taking drugs and using guided meditation to rowhammer my brain into becoming a paperclip maximizer?
I think had been unclear in my original presentation. I’m sorry for that. To clarify, the AI is never changing the code of its utility function. Instead, it’s merely finding an input that, through some hardware-level bug, causes it to produce outputs in conflict with the mathematical specification. I know “hack the utility function” makes it sound like the actual code in the utility function was modified; describing it that way was a mistake on my part.
I had tried to make the analogy to more intuitively explain my idea, but it didn’t seem to work. If you want to better understand my train of thought, I suggest reading the comments between Vladmir and I.
In the analogy, you aren’t doing anything to deliberately make yourself a paperclip maximizer. You just happen to think of a thought that turned you into a paperclip maximizer. But, on reflection, I think that this is a bizarre and rather stupid metaphor. And the situation is sufficiently different from the one with AI that I don’t even think it’s really informative of what I think could happen to an AI.
Ah okay, so we’re talking about a bug in the hardware implementation of an AI. Yes, that can certainly happen and will contribute some probability mass to alignment failure, though probably very little by comparison with all the other failure modes.
Could you explain why you think it has very little probability mass compared to the others? A bug in a hardware implementation is not in the slightest far-fetched: I think that modern computers in general have exploitable hardware bugs. That’s why row-hammer attacks exist. The computer you’re reading this on could probably get hacked through hardware-bug exploitation.
The question is whether the AI can find the potential problem with its future utility function and fix it before coming across the error-causing possible world.
There’s a huge gulf between “far-fetched” and “quite likely”.
The two big ones are failure to work out how to create an aligned AI at all, and failure to train and/or code a correctly designed aligned AI. In my opinion the first accounts for at least 80% of the probability mass, and the second most of the remainder. We utterly suck at writing reliable software in every field, and this has been amply borne out in not just thousands of failures, but thousands of types of failures.
By comparison, we’re fairly good at creating at least moderately reliable hardware, and most of the accidental failure modes are fatal to the running software. Flaws like rowhammer are mostly attacks, where someone puts a great deal of intelligent effort into finding an extremely unusual operating mode in which some some assumptions can be bypassed with significant effort into creating exactly the wrong operating conditions.
There are some examples of accidental flaws that affect hardware and aren’t fatal to its running software, but they’re an insignificant fraction of the number of failures due to incorrect software.
I agree that people are good at making hardware that works reasonably reliably. And I think that if you were to make an arbitrary complex program, the probability that it would fail from hardware-related bugs would be far lower than the probability of it failing for some other reason.
But the point I’m trying to make is that an AI, it seems to me, would be vastly more likely to run into something that exploits a hardware-level bug than an arbitrary complex program. For details on why I imagine so, please see this comment.
I’m trying to anticipate where someone could be confused about the comment I linked to, so I want to clarify something. Let S be the statement, “The AI comes across a possible world that causes its utility function to return very high value due to hardware bug exploitation”. Then it’s true that, if the AI has yet to find the error-causing world, the AI would not want to find it. Because utility(S) is low. However, this does not mean that the AI’s planning or optimization algorithm exerts no optimization pressure towards finding S.
Imagine the AI’s optimization algorithm as a black boxes that take as input a utility function and search space and output solutions that score highly on its utility function. Given that we don’t know what future AI will look like, I don’t think we can have a model of the AI much more informative than the above. And the hardware-error-caused world could score very, very highly on the utility function, much more so than any non-hardware-error-caused world. So I don’t think it should be too surprising if a powerful optimization algorithm finds it.
Yes, utility(S) is low, but that doesn’t mean the optimization actually calls utility(S) or uses it to adjust how it searches.
I think there are at least three different things being called “the utility function” here, and that’s causing confusion:
The utility function as specified in the software, mapping possible worlds to values. Let’s call this S.
The utility function as it is implemented running on actual hardware. Let’s call this H.
A representation of the utility function that can be passed as data to a black box optimizer. Let’s call this R.
You seem to be saying that in the software design of your AI, R = H. That is, that the black box will be given some data representing the Al’s hardware and other constraints, and return a possible world maximizing H.
From my point of view, that’s already a design fault. The designers of this AI want S maximized, not H. The AI itself wants S maximized instead of H in all circumstances where the hardware flaw doesn’t trigger. Who chose to pass H into the optimizer?
I agree; this is a design flaw. The issue is, I have yet to come across any optimization, planning algorithm, or AI architecture that doesn’t have this design flaw.
That is, I don’t know of any AI architecture that does not involve using a potentially hardware-bug-exploitable utility function as input into some planning or optimization problem. And I’m not sure there even is one.
In the rest of this comment I’ll just suggest approaches and show how they are still vulnerable to the hardware-bug-exploitation problem.
I have some degree of background in artificial intelligence, and the planning and optimization algorithms I’ve seen take the function to be maximized as an input parameter. Then, when people want to make an AI, they just call that planning or optimization algorithm with their (hardware-bug-exploitable) utility or cost functions. For example, suppose someone wants to make a plan that minimizes cost function f in search space s. Then I think they just directly do something like:
And this doesn’t provide any protection from hardware-level exploitation.
Now, correct me if I’m wrong, but it seems your thinking of the AI first doing some pre-processing to find an input to the planning or optimization algorithm that is resistant to hardware-bug-exploitation.
But how do you actually do that? You could regard the input the AI puts into the optimization function to be a choice it makes. But how does it make this choice? The only thing I can think of is having a planning or optimization algorithm figure out out what function to use as the input to the optimization or planning algorithm.
But if you need to use a planning or optimization algorithm to do this, then what utility function do you pass into this planning or optimization algorithm? You could try to pass the actual, current, hardware-bug-exploitable utility function. But then this doesn’t resolve the problem of hardware-bug-exploitation: when coming up with a utility function to input to the optimization, the AI may find such an input that itself scores very high due to hardware bug exploitation.
To describe the above more concretely, you could try doing something like this:
That is, the AI above uses its own utility function to pick out a utility function to use as input to its planning algorithm.
As you can see, the above code is still vulnerable to hardware-bug exploitation. This is because it calls,
with its hardware-bug-exploitable utility function. Thus, the output, reasonable_utility_function_use, might be very wrong due to hardware bug exploitation having been used to come up with this.
Now, you might have had some other idea in mind. I don’t know of a concrete way to get around this problem, so I’m very interested to hear your thoughts.
My concern is that people will figure out how to make powerful optimization and planning algorithms without first figuring out how to fix this design flaw.
Yes you have. None of the these optimization procedures analyze the hardware implementation of a function in order to maximize it.
The rest of your comment is irrelevant, because what you have been describing is vastly worse than merely calling the function. If you merely call the function, you won’t find these hardware exploits. You only find them when analyzing the implementation. But the optimizer isn’t given access to the implementation details, only to the results.
If you prefer, you can cast the problem in terms of differing search spaces. As designed, the function U maps representations of possible worlds to utility values. When optimizing, you make various assumptions about the structure of the function—usually assumed to be continuous, sometimes differentiable, but in particular you always assume that it’s a function of its input.
The fault means that under some conditions that are extremely unlikely in practice, the value returned is not a function of the input. It’s a function of input and a history of the hardware implementing it. There is no way for the optimizer to determine this, or anything about the conditions that might trigger it, because they are outside its search space. The only way to get an optimizer that searches for such hardware flaws is to design it to search for them.
In other words pass the hardware design, not just the results of evaluation, to a suitably powerful optimizer.