So it seems like this framing of alignment removes the notion of the AI “optimizing for something” or “being goal-directed”. Do you endorse dropping that idea?
With just this general argument, I would probably not argue for AI risk—if I had to argue for it, the argument would go “we ask the AI to do something, this gets mistranslated and the AI does something else with weird consequences, maybe the weird consequences include extinction”, but it sure seems like as it starts doing the “something else” we would e.g. turn it off.
Starting point: the problem which makes AI alignment hard is not the same problem which makes AI dangerous. This is the capabilities/alignment distinction: AI with extreme capabilities is dangerous; aligning it is the hard part.
So it seems like this framing of alignment removes the notion of the AI “optimizing for something” or “being goal-directed”. Do you endorse dropping that idea?
Anything with extreme capabilities is dangerous, and needs to be aligned. This applies even outside AI—e.g. we don’t want a confusing interface on a nuclear silo. Lots of optimization power is a sufficient condition for extreme capabilities, but not a necessary condition.
Here’s a plausible doom scenario without explicit optimization. Imagine an AI which is dangerous in the same way as a nuke is dangerous, but more so: it can make large irreversible changes to the world too quickly for anyone to to stop it. Maybe it’s capable of designing and printing a supervirus (and engineered bio-offence is inherently easier than engineered bio-defense); maybe it’s capable of setting off all the world’s nukes simultaneously; maybe it’s capable of turning the world into grey goo.
If that AI is about as transparent as today’s AI, and does things the user wasn’t expecting about as often as today’s AI, then that’s not going to end well.
Now, there is the counterargument that this scenario would produce a fire alarm, but there’s a whole host of ways that could fail:
The AI is usually very useful, so the risks are ignored
Errors are patched rather than fixing the underlying problem
Really big errors turn out to be “easier” than small errors—i.e. high-to-low level translations are more likely to be catastrophically wrong than mildly wrong
It’s hard to check in testing whether there’s a problem, because errors are rare and/or don’t look like errors at the low-level (and it’s hard/expensive to check results at the high-level)
In the absence of optimization pressure, the AI won’t actively find corner-cases in our specification of what-we-want, so it might actually be more difficult to notice problems ahead-of-time
...
Getting back to your question:
Do you endorse dropping that idea?
I don’t endorse dropping the AI-as-optimizer idea entirely. It is definitely a sufficient condition for AI to be dangerous, and a very relevant sufficient condition. But I strongly endorse the idea that optimization is not a necessary condition for AI to be dangerous. Tool AI can be plenty dangerous if it’s capable of making large, fast, irreversible changes to the world, and the alignment problem is still hard for that sort of AI.
Tool AI can be plenty dangerous if it’s capable of making large, fast, irreversible changes to the world, and the alignment problem is still hard for that sort of AI.
I definitely agree with that characterization. I think the solutions I would look for would be quite different though: they would look more like “how do I ensure that the AI system has an undo button” and “how do I ensure that the AI system does things slowly”, similarly to how with nuclear power plants (I assume) there are (possibly redundant) mechanisms that ensure you can turn off the power plant.
Of course these solutions are also subject to the same translation problem, but it seems plausible to me that that translation problem is easier to solve, relative to solving translation in full generality.
AI-as-optimizer would suggest that even if the translation problem were solved for the particular things I mentioned, it still might not be enough, because e.g. the AI might deliberately prevent me from pressing the undo button.
You could say something like “an AI that can enact large irreversible changes might form a plan where the large irreversible change starts with disabling the undo button”, but then it sort of feels like we’re bringing back in the idea of optimization. Maybe that’s fine, we’re pretty confused about optimization anyway.
“how do I ensure that the AI system has an undo button” and “how do I ensure that the AI system does things slowly”
I don’t think this is realistic if we want an economically-competitive AI. There are just too many real-world applications where we want things to happen which are fast and/or irreversible. In particular, the relevant notion of “slow” is roughly “a human has time to double-check”, which immediately makes things very expensive.
Even if we abandon economic competitiveness, I doubt that slow+reversible makes the translation problem all that much easier (though it would make the AI at least somewhat less dangerous, I agree with that). It’s probably somewhat easier—having a few cycles of feedback seems unlikely to make the problem harder. But if e.g. we’re originally training the AI via RL, then slow+reversible basically just adds a few more feedback cycles after deployment; if millions or billions of RL cycles didn’t solve the problem, then adding a handful more at the end seems unlikely to help much (though an argument could be made that those last few are higher-quality). Also, there’s still the problem of translating a human’s high-level notion of “reversible” into a low-level notion of “reversible”.
Taking a more outside view… restrictions like “make it slow and reversible” feel like patches which don’t really address the underlying issues. In general, I’d expect the underlying issues to continue to manifest themselves in other ways when patches are applied. For instance, even with slow & reversible changes, it’s still entirely plausible that humans don’t stop something bad because they don’t understand what’s going on in enough detail—that’s a typical scenario in the “translation problem” worldview.
Zooming out even further...
I think the solutions I would look for would be quite different though...
I think what’s driving this intuition is that you’re looking for ways to make the AI not dangerous, without actually aligning it (i.e. without solving the translation problem) - mainly by limiting capabilities. I expect that such strategies, in general, will run into similar problems to those mentioned above:
Capabilities which make an AI economically valuable are often capabilities which make it dangerous. Limit capabilities for safety, and the AI won’t be economically competitive.
Choosing which capabilities are “dangerous” is itself a problem of translating what-humans-want into some other framework, and is subject to the usual problems: simple solutions will be patches which don’t address everything, there will be a long tail of complicated corner cases, etc.
I think what’s driving this intuition is that you’re looking for ways to make the AI not dangerous, without actually aligning it (i.e. without solving the translation problem) - mainly by limiting capabilities.
Yup, that is definitely the intuition.
Taking a more outside view… restrictions like “make it slow and reversible” feel like patches which don’t really address the underlying issues.
Agreed.
In general, I’d expect the underlying issues to continue to manifest themselves in other ways when patches are applied.
I mean, they continue to manifest in the normal sense, in that when you say “cure cancer”, the AI systems works on a plan to kill everyone; you just now get to stop the AI system from actually running that plan.
For instance, even with slow & reversible changes, it’s still entirely plausible that humans don’t stop something bad because they don’t understand what’s going on in enough detail—that’s a typical scenario in the “translation problem” worldview.
[...]
Also, there’s still the problem of translating a human’s high-level notion of “reversible” into a low-level notion of “reversible”.
[...]
simple solutions will be patches which don’t address everything, there will be a long tail of complicated corner cases, etc.
All of this is true; I’m more arguing that slow & reversible eliminates ~95% of the problems, and so if it’s easier to do than “full” alignment, then it probably becomes the best thing to do on the margin.
I don’t think this is realistic if we want an economically-competitive AI. There are just too many real-world applications where we want things to happen which are fast and/or irreversible. In particular, the relevant notion of “slow” is roughly “a human has time to double-check”, which immediately makes things very expensive.
I’d expect we’d be able to solve this over time, e.g. first you use your AI system for simple tasks which you can check quickly, then as you start trusting that you’ve worked out the bugs for those tasks, you let the AI do them faster / without oversight, and move on to more complicated tasks, etc.
(This is a much more testing + engineering based approach; the standard argument against such an approach is that it fails in the presence of optimization.)
It certainly does mean you take a hit to economic competitiveness, I mostly think the hit is not that large and is something we could pay.
I agree with most of this reasoning. I think my main point of departure is that I expect most of the value is in the long tail, i.e. eliminating 95% of problems generates <10% or maybe even <1% of the value. I expect this both in the sense that eliminating 95% of problems unlocks only a small fraction of economic value, and in the sense that eliminating 95% of problems removes only a small fraction of risk. (For the economic value part, this is mostly based on industry experience trying to automate things.)
Optimization is indeed the standard argument for this sort of conclusion, and is a sufficient condition for eliminating 95% of problems to have little impact on risk. But again, it’s not a necessary condition—if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn’t really decreased. And that’s exactly the sort of situation I expect when viewing translation as the central problem: illusion of transparency is exactly the sort of thing which doesn’t seem like a problem 95% of the time, right up until you realize that everything was completely broken all along.
Anyway, sounds like value-in-the-tail is a central crux here.
Anyway, sounds like value-in-the-tail is a central crux here.
Seems somewhat right to me, subject to caveat below.
it’s not a necessary condition—if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn’t really decreased.
An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you’ve translated better and now you’ve eliminated 99% of the risk, and iterating this process you get to effectively no ongoing risk. There is of course risk during the iteration, but that risk can be reasonably small.
A similar argument applies to economic competitiveness: yes, your first agent is pretty slow relative to what it could be, but you can make it faster and faster over time, so you only lose a lot of value during the first few initial phases.
(For the economic value part, this is mostly based on industry experience trying to automate things.)
I have the same intuition, and strongly agree that usually most of the value is in the long tail. The hope is mostly that you can actually keep making progress on the tail as time goes on, especially with the help of your newly built AI systems.
An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you’ve translated better and now you’ve eliminated 99% of the risk...
I don’t see how this ever actually gets around the chicken-and-egg problem.
An analogy: we want to translate from English to Korean. We first obtain a translation dictionary which is 95% accurate, then use it to ask our Korean-speaking friend to help out. Problem is, there’s a very important difference between very similar translations of “help me translate things”—e.g. consider the difference between “what would you say if you wanted to convey X?” and “what should I say if I want to convey X?”, when giving instructions to an AI. Both of those would produce very similar results, right up until everything went wrong. (Let me know if this analogy sounds representative of the strategies you imagine.)
If you do manage to get that first translation exactly right, and successfully ask your friend for help, then you’re good—similar to the “translate how-to-translate” strategy from the OP. And with a 95% accurate dictionary, you might even have a decent chance of getting that first translation right. But if that first translation isn’t perfect, then you need some way to find that out safely—and the 95% accurate dictionary doesn’t make that any easier.
Another way to look at it: the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further. We need some other way to get at the ground truth, in order to actually reduce the error rate. If we know how to convey what-we-want with 95% accuracy, then we need some other way to get at the ground truth of translation in order to increase that accuracy further.
Let me know if this analogy sounds representative of the strategies you imagine.
Yeah, it does. I definitely agree that this doesn’t get around the chicken-and-egg problem, and so shouldn’t be expected to succeed on the first try. It’s more like you get to keep trying this strategy over and over again until you eventually succeed, because if everything goes wrong you just unplug the AI system and start over.
the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further.
I think you get “ground truth data” by trying stuff and seeing whether or not the AI system did what you wanted it to do.
(This does suggest that you wouldn’t ever be able to ask your AI system to do something completely novel without having a human along to ensure it’s what we actually meant, which seems wrong to me, but I can’t articulate why.)
I think you get “ground truth data” by trying stuff and seeing whether or not the AI system did what you wanted it to do.
That’s the sort of strategy where illusion of transparency is a big problem, from a translation point of view. The difficult cases are exactly the cases where the translation usually produces the results you expect, but then produce something completely different in some rare cases.
Another way to put it: if we’re gathering data by seeing whether the system did what we wanted, then the long tail problem works against us pretty badly. Those rare tail-cases are exactly the cases we would need to observe in order to notice problems and improve the system. We’re not going to have very many of them to work with. Ability to generalize from small data sets becomes a key capability, but then we need to translate how-to-generalize in order for the AI to generalize in the ways we want (this gets at the can’t-ask-the-AI-to-do-anything-novel problem).
(The other comment is my main response, but there’s a possibly-tangential issue here.)
In a long-tail world, if we manage to eliminate 95% of problems, then we generate maybe 10% of the value. So now we use our 10%-of-value product to refine our solution. But it seems rather optimistic to hope that a product which achieves only 10% of the value gets us all the way to a 99% solution. It seems far more likely that it gets to, say, a 96% solution. That, in turn, generates maybe 15% of the value, which in turn gets us to a 96.5% solution, and...
Point being: in the long-tail world, it’s at least plausible (and I would say more likely than not) that this iterative strategy doesn’t ever converge to a high-value solution. We get fancier and fancier refinements with decreasing marginal returns, which never come close to handling the long tail.
Now, under this argument, it’s still a fine idea to try the iterative strategy. But you wouldn’t want to bet too heavily on its success, especially without a reliable way to check whether it’s working.
I don’t think this is realistic if we want an economically-competitive AI. There are just too many real-world applications where we want things to happen which are fast and/or irreversible. In particular, the relevant notion of “slow” is roughly “a human has time to double-check”, which immediately makes things very expensive.
There’s already an answer to that: you separate “fast” from “unpredictable”. The AI that does things fast is not the AI that engages in out-of-the-box thinking.
Predictable low-level behavior is not the same as predictable high-level behavior. When I write or read python code, I can have a pretty clear idea of what every line does in a low-level sense, but still sometimes be surprised by high-level behavior of the code.
We still need to translate what-humans-want into a low-level specification. “Making it predictable” at a low-level doesn’t really get us any closer to predictability at the high-level (at least in the cases which are actually difficult in the first place). “Making it predictable” at a high-level requires translating high-level “predictability” into some low-level specification, which just brings us back to the original problem: translation is hard.
I am assuming that the AI that engages in out-of-the-box thinking is not fast, and that the conjunction of fast *and* unpredictable is the central problem.
The market will demand AI that’s faster than humans, and at least as capable of creative, unpredictable thinking. However, the same AI does not have to be both. This approach to AI safety is copied from a widespread organisational principal, where the higher levels do the abstract strategic thinking, the least predictable stuff, the middle levels do the concrete, tactical thinking and the lowest levels do what they are told. The fastest and most fine grained actions are at the lowest level. The higher level can only communicate with the lower levels by communicating an amended strategy or policy: they are not able interrupt fine-grained decisions, and only hear about fine grained actions after they have happenned. I have given an abstract description of this organising principle because there are multiple concrete examples: large businesses, militaries, and the human brain/CNS. Businesses already use fast but not very flexible systems to do things faster than humans, notably in high frequency trading. The question is whether more advanced AI’s will be responsible for fine-grained trading decisions, the all-in-one approach, or whether advanced AI will substitute for or assist business analysts and market strategists.
A standard objection to Tool AI is that having a human check all the TAI’s decisions would slow things up too much. The above architecture allows an alternative, where human checking occurs between levels. In particular, communication from the highest level to the lower ones is slow anyway. The main requisite for this apprach to AI safety is a human readable communications protocol.
Making it predictable” at a high-level requires translating high-level “predictability” into some low-level specification, which just brings us back to the original problem: translation is hard.
If you are checking your high level AI as you go along, you need a high level language that is human comprehensible.
I’m pretty sure none of this actually affects what I said: the low-level behavior still needs produce results which are predictable to humans in order for predictability to be useful, and that’s still hard.
The problem is that making an AI predictable to a human is hard. This is true regardless of whether or not it’s doing any outside-the-box thinking. Having a human double-check the instructions given to a fast low-level AI does not make the problem any easier; the low-level AI’s behavior still has to be understood by a human in order for that to be useful.
As you say toward the end, you’d need something like a human-readable communications protocol. That brings us right back to the original problem: it’s hard to translate between humans’ high-level abstractions and low-level structure. That’s why AI is unpredictable to humans in the first place.
The rules it’s given are, presumably, at a low level themselves. (Even if that’s not the case, the rules it’s given are definitely not human-intelligible unless we’ve already solved the translation problem in full.)
The question is not whether the low-level AI will follow those rules, the question is what actually happens when something follows those rules. A python interpreter will not ever deviate from the simple rules of python, yet it still does surprising-to-a-human things all the time. The problem is accurately translating between human-intelligible structure and the rules given to the AI.
The problem is not that the AI might deviate from the given rules. The problem is that the rules don’t always mean what we want them to mean.
The rules it’s given are, presumably, at a low level themselves.
The rules that the low level AI runs on could be medium level. There is no point in giving it very low level rules, since its job is to fill in the details. But the point is that I am stipulating that the rules should be high level enough to be human-readable.
The question is not whether the low-level AI will follow those rules, the question is what actually happens when something follows those rules. A python interpreter will not ever deviate from the simple rules of python, yet it still does surprising-to-a-human things all the time.
But the world hasn’t ended. A python interpreter doesn’t do surprisingly intelligent things, because it is not intelligent.
The problem is not that the AI might deviate from the given rules. The problem is that the rules don’t always mean what we want them to mean.
In your framing of the problem , you create one superpowerful AI that has to be programmed perfectly, which is impossible. In my solution, you reduce the problem to more manageable chunks. My solution is already partially implemented.
But the point is that I am stipulating that the rules should be high level enough to be human-readable.
If the rules are high level enough to be human readable, then translating them into something a computer can run while still maintaining the original intent is hard. That’s basically the whole alignment problem. If an AI is doing that translation, then writing/training that AI is as hard as the whole alignment problem.
A python interpreter doesn’t do surprisingly intelligent things, because it is not intelligent.
If a system is doing large, fast, irreversible things, then it does not matter whether those things are surprisingly intelligent. If they’re surprising, then that’s sufficient for it to be a problem.
In your framing of the problem , you create one superpowerful AI that has to be programmed perfectly, which is impossible.
I’m not sure what gave you that impression, but I definitely do not intend to assume any of that.
If the rules are high level enough to be human readable, then translating them into something a computer can run while still maintaining the original intent is hard.
It’s not harder than AGI, because NL is a central part of AGI.
That’s basically the whole alignment problem.
No it isn’t. You can have systems that do what they are told without having any notion of values and preferences. The higher level systems need goals because they are defining strategy,but only the higher level ones.
If a system is doing large, fast, irreversible things, then it does not matter whether those things are surprisingly intelligent. If they’re surprising, then that’s sufficient for it to be a problem.
Yes, but that’s a problem we already have, with solutions we already have. For instance, high frequency trading systems can be shut down [automatically] if the market moves too much.
Yes, but that’s a problem we already have, with solutions we already have.
It is a problem we already have, but the solutions we already have are all based on the assumption that either (a) we know in advance what kind of problems can happen, or (b) the problem doesn’t kill us all in one shot. For instance, in your HFT system shutdown example, we already know that “market moves too much” is something which makes a lot of HFT systems not work very well. But how did we learn that? Either we had a prior idea of what problems could happen (implying some transparency of the system), or the problem happened at least once and we learned from that (implying it didn’t kill us the first time—see e.g. Knight capital).
With AI, it’s the same old problem, but on hard mode (i.e. the system is very opaque) and high stakes (i.e. we don’t necessarily the survive the first big mistake). That’s exactly the sort of scenario where our current solutions do not work.
It’s not harder than AGI, because NL is a central part of AGI.
NL? I’m not familiar with this acronym. Also I said it’s as hard as alignment, not as hard as AGI, in case that’s relevant.
No it isn’t. You can have systems that do what they are told without having any notion of values and preferences. The higher level systems need goals because they are defining strategy,but only the higher level ones.
I’m not even convinced that higher-level systems necessarily need goals. Pure goal-free tool AI is one possible path; the OP was written to be agnostic to such considerations.
Indeed, that’s a big part of why I say translation is the central piece of the alignment problem: it’s the piece that’s agnostic. It’s the piece that has to be there, in every scheme, under a wide range of assumptions about how the world works. Tool AI? Still needs to solve the translation problem in order to be safe and useful, even without any notion of values or preferences. Utility-maximizing AI? Needs to solve the translation problem in order to be safe and useful. Hierarchical scheme? Translation still needs to be handled somewhere in order to be safe and useful. Humans-consulting-humans or variations thereof? Full system needs to solve the translation problem in order to be safe and useful. Etc.
NL? I’m not familiar with this acronym. Also I said it’s as hard as alignment, not as hard as AGI, in case that’s relevant.
Presumably “natural language”, which often gets called NLP for “natural language processing” in AI.
I think the right response there is something like “suppose you have an AGI that can understand what a human means as well as another human does; now you still have all the difficulty of interpretation that makes law a complicated and contentious field.” It’d be nice to be able to write a Constitution and recognize it after the AI has thought about it while having adversarial pressure on how to interpret it for 300 years, for example.
So it seems like this framing of alignment removes the notion of the AI “optimizing for something” or “being goal-directed”. Do you endorse dropping that idea?
With just this general argument, I would probably not argue for AI risk—if I had to argue for it, the argument would go “we ask the AI to do something, this gets mistranslated and the AI does something else with weird consequences, maybe the weird consequences include extinction”, but it sure seems like as it starts doing the “something else” we would e.g. turn it off.
Starting point: the problem which makes AI alignment hard is not the same problem which makes AI dangerous. This is the capabilities/alignment distinction: AI with extreme capabilities is dangerous; aligning it is the hard part.
Anything with extreme capabilities is dangerous, and needs to be aligned. This applies even outside AI—e.g. we don’t want a confusing interface on a nuclear silo. Lots of optimization power is a sufficient condition for extreme capabilities, but not a necessary condition.
Here’s a plausible doom scenario without explicit optimization. Imagine an AI which is dangerous in the same way as a nuke is dangerous, but more so: it can make large irreversible changes to the world too quickly for anyone to to stop it. Maybe it’s capable of designing and printing a supervirus (and engineered bio-offence is inherently easier than engineered bio-defense); maybe it’s capable of setting off all the world’s nukes simultaneously; maybe it’s capable of turning the world into grey goo.
If that AI is about as transparent as today’s AI, and does things the user wasn’t expecting about as often as today’s AI, then that’s not going to end well.
Now, there is the counterargument that this scenario would produce a fire alarm, but there’s a whole host of ways that could fail:
The AI is usually very useful, so the risks are ignored
Errors are patched rather than fixing the underlying problem
Really big errors turn out to be “easier” than small errors—i.e. high-to-low level translations are more likely to be catastrophically wrong than mildly wrong
It’s hard to check in testing whether there’s a problem, because errors are rare and/or don’t look like errors at the low-level (and it’s hard/expensive to check results at the high-level)
In the absence of optimization pressure, the AI won’t actively find corner-cases in our specification of what-we-want, so it might actually be more difficult to notice problems ahead-of-time
...
Getting back to your question:
I don’t endorse dropping the AI-as-optimizer idea entirely. It is definitely a sufficient condition for AI to be dangerous, and a very relevant sufficient condition. But I strongly endorse the idea that optimization is not a necessary condition for AI to be dangerous. Tool AI can be plenty dangerous if it’s capable of making large, fast, irreversible changes to the world, and the alignment problem is still hard for that sort of AI.
I definitely agree with that characterization. I think the solutions I would look for would be quite different though: they would look more like “how do I ensure that the AI system has an undo button” and “how do I ensure that the AI system does things slowly”, similarly to how with nuclear power plants (I assume) there are (possibly redundant) mechanisms that ensure you can turn off the power plant.
Of course these solutions are also subject to the same translation problem, but it seems plausible to me that that translation problem is easier to solve, relative to solving translation in full generality.
AI-as-optimizer would suggest that even if the translation problem were solved for the particular things I mentioned, it still might not be enough, because e.g. the AI might deliberately prevent me from pressing the undo button.
You could say something like “an AI that can enact large irreversible changes might form a plan where the large irreversible change starts with disabling the undo button”, but then it sort of feels like we’re bringing back in the idea of optimization. Maybe that’s fine, we’re pretty confused about optimization anyway.
I don’t think this is realistic if we want an economically-competitive AI. There are just too many real-world applications where we want things to happen which are fast and/or irreversible. In particular, the relevant notion of “slow” is roughly “a human has time to double-check”, which immediately makes things very expensive.
Even if we abandon economic competitiveness, I doubt that slow+reversible makes the translation problem all that much easier (though it would make the AI at least somewhat less dangerous, I agree with that). It’s probably somewhat easier—having a few cycles of feedback seems unlikely to make the problem harder. But if e.g. we’re originally training the AI via RL, then slow+reversible basically just adds a few more feedback cycles after deployment; if millions or billions of RL cycles didn’t solve the problem, then adding a handful more at the end seems unlikely to help much (though an argument could be made that those last few are higher-quality). Also, there’s still the problem of translating a human’s high-level notion of “reversible” into a low-level notion of “reversible”.
Taking a more outside view… restrictions like “make it slow and reversible” feel like patches which don’t really address the underlying issues. In general, I’d expect the underlying issues to continue to manifest themselves in other ways when patches are applied. For instance, even with slow & reversible changes, it’s still entirely plausible that humans don’t stop something bad because they don’t understand what’s going on in enough detail—that’s a typical scenario in the “translation problem” worldview.
Zooming out even further...
I think what’s driving this intuition is that you’re looking for ways to make the AI not dangerous, without actually aligning it (i.e. without solving the translation problem) - mainly by limiting capabilities. I expect that such strategies, in general, will run into similar problems to those mentioned above:
Capabilities which make an AI economically valuable are often capabilities which make it dangerous. Limit capabilities for safety, and the AI won’t be economically competitive.
Choosing which capabilities are “dangerous” is itself a problem of translating what-humans-want into some other framework, and is subject to the usual problems: simple solutions will be patches which don’t address everything, there will be a long tail of complicated corner cases, etc.
Yup, that is definitely the intuition.
Agreed.
I mean, they continue to manifest in the normal sense, in that when you say “cure cancer”, the AI systems works on a plan to kill everyone; you just now get to stop the AI system from actually running that plan.
All of this is true; I’m more arguing that slow & reversible eliminates ~95% of the problems, and so if it’s easier to do than “full” alignment, then it probably becomes the best thing to do on the margin.
I’d expect we’d be able to solve this over time, e.g. first you use your AI system for simple tasks which you can check quickly, then as you start trusting that you’ve worked out the bugs for those tasks, you let the AI do them faster / without oversight, and move on to more complicated tasks, etc.
(This is a much more testing + engineering based approach; the standard argument against such an approach is that it fails in the presence of optimization.)
It certainly does mean you take a hit to economic competitiveness, I mostly think the hit is not that large and is something we could pay.
I agree with most of this reasoning. I think my main point of departure is that I expect most of the value is in the long tail, i.e. eliminating 95% of problems generates <10% or maybe even <1% of the value. I expect this both in the sense that eliminating 95% of problems unlocks only a small fraction of economic value, and in the sense that eliminating 95% of problems removes only a small fraction of risk. (For the economic value part, this is mostly based on industry experience trying to automate things.)
Optimization is indeed the standard argument for this sort of conclusion, and is a sufficient condition for eliminating 95% of problems to have little impact on risk. But again, it’s not a necessary condition—if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn’t really decreased. And that’s exactly the sort of situation I expect when viewing translation as the central problem: illusion of transparency is exactly the sort of thing which doesn’t seem like a problem 95% of the time, right up until you realize that everything was completely broken all along.
Anyway, sounds like value-in-the-tail is a central crux here.
Seems somewhat right to me, subject to caveat below.
An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you’ve translated better and now you’ve eliminated 99% of the risk, and iterating this process you get to effectively no ongoing risk. There is of course risk during the iteration, but that risk can be reasonably small.
A similar argument applies to economic competitiveness: yes, your first agent is pretty slow relative to what it could be, but you can make it faster and faster over time, so you only lose a lot of value during the first few initial phases.
I have the same intuition, and strongly agree that usually most of the value is in the long tail. The hope is mostly that you can actually keep making progress on the tail as time goes on, especially with the help of your newly built AI systems.
I don’t see how this ever actually gets around the chicken-and-egg problem.
An analogy: we want to translate from English to Korean. We first obtain a translation dictionary which is 95% accurate, then use it to ask our Korean-speaking friend to help out. Problem is, there’s a very important difference between very similar translations of “help me translate things”—e.g. consider the difference between “what would you say if you wanted to convey X?” and “what should I say if I want to convey X?”, when giving instructions to an AI. Both of those would produce very similar results, right up until everything went wrong. (Let me know if this analogy sounds representative of the strategies you imagine.)
If you do manage to get that first translation exactly right, and successfully ask your friend for help, then you’re good—similar to the “translate how-to-translate” strategy from the OP. And with a 95% accurate dictionary, you might even have a decent chance of getting that first translation right. But if that first translation isn’t perfect, then you need some way to find that out safely—and the 95% accurate dictionary doesn’t make that any easier.
Another way to look at it: the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further. We need some other way to get at the ground truth, in order to actually reduce the error rate. If we know how to convey what-we-want with 95% accuracy, then we need some other way to get at the ground truth of translation in order to increase that accuracy further.
Yeah, it does. I definitely agree that this doesn’t get around the chicken-and-egg problem, and so shouldn’t be expected to succeed on the first try. It’s more like you get to keep trying this strategy over and over again until you eventually succeed, because if everything goes wrong you just unplug the AI system and start over.
I think you get “ground truth data” by trying stuff and seeing whether or not the AI system did what you wanted it to do.
(This does suggest that you wouldn’t ever be able to ask your AI system to do something completely novel without having a human along to ensure it’s what we actually meant, which seems wrong to me, but I can’t articulate why.)
That’s the sort of strategy where illusion of transparency is a big problem, from a translation point of view. The difficult cases are exactly the cases where the translation usually produces the results you expect, but then produce something completely different in some rare cases.
Another way to put it: if we’re gathering data by seeing whether the system did what we wanted, then the long tail problem works against us pretty badly. Those rare tail-cases are exactly the cases we would need to observe in order to notice problems and improve the system. We’re not going to have very many of them to work with. Ability to generalize from small data sets becomes a key capability, but then we need to translate how-to-generalize in order for the AI to generalize in the ways we want (this gets at the can’t-ask-the-AI-to-do-anything-novel problem).
(The other comment is my main response, but there’s a possibly-tangential issue here.)
In a long-tail world, if we manage to eliminate 95% of problems, then we generate maybe 10% of the value. So now we use our 10%-of-value product to refine our solution. But it seems rather optimistic to hope that a product which achieves only 10% of the value gets us all the way to a 99% solution. It seems far more likely that it gets to, say, a 96% solution. That, in turn, generates maybe 15% of the value, which in turn gets us to a 96.5% solution, and...
Point being: in the long-tail world, it’s at least plausible (and I would say more likely than not) that this iterative strategy doesn’t ever converge to a high-value solution. We get fancier and fancier refinements with decreasing marginal returns, which never come close to handling the long tail.
Now, under this argument, it’s still a fine idea to try the iterative strategy. But you wouldn’t want to bet too heavily on its success, especially without a reliable way to check whether it’s working.
Yeah, this could be a way that things are. My intuition is that it wouldn’t be this way, but I don’t have any good arguments for it.
There’s already an answer to that: you separate “fast” from “unpredictable”. The AI that does things fast is not the AI that engages in out-of-the-box thinking.
Predictable low-level behavior is not the same as predictable high-level behavior. When I write or read python code, I can have a pretty clear idea of what every line does in a low-level sense, but still sometimes be surprised by high-level behavior of the code.
We still need to translate what-humans-want into a low-level specification. “Making it predictable” at a low-level doesn’t really get us any closer to predictability at the high-level (at least in the cases which are actually difficult in the first place). “Making it predictable” at a high-level requires translating high-level “predictability” into some low-level specification, which just brings us back to the original problem: translation is hard.
I am assuming that the AI that engages in out-of-the-box thinking is not fast, and that the conjunction of fast *and* unpredictable is the central problem.
The market will demand AI that’s faster than humans, and at least as capable of creative, unpredictable thinking.
However, the same AI does not have to be both. This approach to AI safety is copied from a widespread organisational
principal, where the higher levels do the abstract strategic thinking, the least predictable stuff,
the middle levels do the concrete, tactical thinking and the lowest levels do what they are told.
The fastest and most fine grained actions are at the lowest level. The higher level can only communicate with the lower levels by communicating an amended strategy or policy: they are not able interrupt fine-grained decisions, and only hear about fine grained actions after they have happenned. I have given an abstract description of this organising principle because there are multiple concrete examples: large businesses, militaries, and the human brain/CNS. Businesses already use fast but not very flexible systems to do things faster than humans, notably in high frequency trading. The question is whether
more advanced AI’s will be responsible for fine-grained trading decisions, the all-in-one approach, or whether advanced AI will substitute for or assist business analysts and market strategists.
A standard objection to Tool AI is that having a human check all the TAI’s decisions would slow things up too much. The above architecture allows an alternative, where human checking occurs between levels. In particular, communication from the highest level to the lower ones is slow anyway. The main requisite for this apprach to AI safety is a human readable communications protocol.
If you are checking your high level AI as you go along, you need a high level language that is human comprehensible.
I’m pretty sure none of this actually affects what I said: the low-level behavior still needs produce results which are predictable to humans in order for predictability to be useful, and that’s still hard.
The problem is that making an AI predictable to a human is hard. This is true regardless of whether or not it’s doing any outside-the-box thinking. Having a human double-check the instructions given to a fast low-level AI does not make the problem any easier; the low-level AI’s behavior still has to be understood by a human in order for that to be useful.
As you say toward the end, you’d need something like a human-readable communications protocol. That brings us right back to the original problem: it’s hard to translate between humans’ high-level abstractions and low-level structure. That’s why AI is unpredictable to humans in the first place.
If you know in general that a low level AI will follow the rule si has been given, you don’t need to keep re-checking.
The rules it’s given are, presumably, at a low level themselves. (Even if that’s not the case, the rules it’s given are definitely not human-intelligible unless we’ve already solved the translation problem in full.)
The question is not whether the low-level AI will follow those rules, the question is what actually happens when something follows those rules. A python interpreter will not ever deviate from the simple rules of python, yet it still does surprising-to-a-human things all the time. The problem is accurately translating between human-intelligible structure and the rules given to the AI.
The problem is not that the AI might deviate from the given rules. The problem is that the rules don’t always mean what we want them to mean.
The rules that the low level AI runs on could be medium level. There is no point in giving it very low level rules, since its job is to fill in the details. But the point is that I am stipulating that the rules should be high level enough to be human-readable.
But the world hasn’t ended. A python interpreter doesn’t do surprisingly intelligent things, because it is not intelligent.
In your framing of the problem , you create one superpowerful AI that has to be programmed perfectly, which is impossible. In my solution, you reduce the problem to more manageable chunks. My solution is already partially implemented.
If the rules are high level enough to be human readable, then translating them into something a computer can run while still maintaining the original intent is hard. That’s basically the whole alignment problem. If an AI is doing that translation, then writing/training that AI is as hard as the whole alignment problem.
If a system is doing large, fast, irreversible things, then it does not matter whether those things are surprisingly intelligent. If they’re surprising, then that’s sufficient for it to be a problem.
I’m not sure what gave you that impression, but I definitely do not intend to assume any of that.
It’s not harder than AGI, because NL is a central part of AGI.
No it isn’t. You can have systems that do what they are told without having any notion of values and preferences. The higher level systems need goals because they are defining strategy,but only the higher level ones.
Yes, but that’s a problem we already have, with solutions we already have. For instance, high frequency trading systems can be shut down [automatically] if the market moves too much.
It is a problem we already have, but the solutions we already have are all based on the assumption that either (a) we know in advance what kind of problems can happen, or (b) the problem doesn’t kill us all in one shot. For instance, in your HFT system shutdown example, we already know that “market moves too much” is something which makes a lot of HFT systems not work very well. But how did we learn that? Either we had a prior idea of what problems could happen (implying some transparency of the system), or the problem happened at least once and we learned from that (implying it didn’t kill us the first time—see e.g. Knight capital).
With AI, it’s the same old problem, but on hard mode (i.e. the system is very opaque) and high stakes (i.e. we don’t necessarily the survive the first big mistake). That’s exactly the sort of scenario where our current solutions do not work.
NL? I’m not familiar with this acronym. Also I said it’s as hard as alignment, not as hard as AGI, in case that’s relevant.
I’m not even convinced that higher-level systems necessarily need goals. Pure goal-free tool AI is one possible path; the OP was written to be agnostic to such considerations.
Indeed, that’s a big part of why I say translation is the central piece of the alignment problem: it’s the piece that’s agnostic. It’s the piece that has to be there, in every scheme, under a wide range of assumptions about how the world works. Tool AI? Still needs to solve the translation problem in order to be safe and useful, even without any notion of values or preferences. Utility-maximizing AI? Needs to solve the translation problem in order to be safe and useful. Hierarchical scheme? Translation still needs to be handled somewhere in order to be safe and useful. Humans-consulting-humans or variations thereof? Full system needs to solve the translation problem in order to be safe and useful. Etc.
Presumably “natural language”, which often gets called NLP for “natural language processing” in AI.
I think the right response there is something like “suppose you have an AGI that can understand what a human means as well as another human does; now you still have all the difficulty of interpretation that makes law a complicated and contentious field.” It’d be nice to be able to write a Constitution and recognize it after the AI has thought about it while having adversarial pressure on how to interpret it for 300 years, for example.