I think what’s driving this intuition is that you’re looking for ways to make the AI not dangerous, without actually aligning it (i.e. without solving the translation problem) - mainly by limiting capabilities.
Yup, that is definitely the intuition.
Taking a more outside view… restrictions like “make it slow and reversible” feel like patches which don’t really address the underlying issues.
Agreed.
In general, I’d expect the underlying issues to continue to manifest themselves in other ways when patches are applied.
I mean, they continue to manifest in the normal sense, in that when you say “cure cancer”, the AI systems works on a plan to kill everyone; you just now get to stop the AI system from actually running that plan.
For instance, even with slow & reversible changes, it’s still entirely plausible that humans don’t stop something bad because they don’t understand what’s going on in enough detail—that’s a typical scenario in the “translation problem” worldview.
[...]
Also, there’s still the problem of translating a human’s high-level notion of “reversible” into a low-level notion of “reversible”.
[...]
simple solutions will be patches which don’t address everything, there will be a long tail of complicated corner cases, etc.
All of this is true; I’m more arguing that slow & reversible eliminates ~95% of the problems, and so if it’s easier to do than “full” alignment, then it probably becomes the best thing to do on the margin.
I don’t think this is realistic if we want an economically-competitive AI. There are just too many real-world applications where we want things to happen which are fast and/or irreversible. In particular, the relevant notion of “slow” is roughly “a human has time to double-check”, which immediately makes things very expensive.
I’d expect we’d be able to solve this over time, e.g. first you use your AI system for simple tasks which you can check quickly, then as you start trusting that you’ve worked out the bugs for those tasks, you let the AI do them faster / without oversight, and move on to more complicated tasks, etc.
(This is a much more testing + engineering based approach; the standard argument against such an approach is that it fails in the presence of optimization.)
It certainly does mean you take a hit to economic competitiveness, I mostly think the hit is not that large and is something we could pay.
I agree with most of this reasoning. I think my main point of departure is that I expect most of the value is in the long tail, i.e. eliminating 95% of problems generates <10% or maybe even <1% of the value. I expect this both in the sense that eliminating 95% of problems unlocks only a small fraction of economic value, and in the sense that eliminating 95% of problems removes only a small fraction of risk. (For the economic value part, this is mostly based on industry experience trying to automate things.)
Optimization is indeed the standard argument for this sort of conclusion, and is a sufficient condition for eliminating 95% of problems to have little impact on risk. But again, it’s not a necessary condition—if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn’t really decreased. And that’s exactly the sort of situation I expect when viewing translation as the central problem: illusion of transparency is exactly the sort of thing which doesn’t seem like a problem 95% of the time, right up until you realize that everything was completely broken all along.
Anyway, sounds like value-in-the-tail is a central crux here.
Anyway, sounds like value-in-the-tail is a central crux here.
Seems somewhat right to me, subject to caveat below.
it’s not a necessary condition—if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn’t really decreased.
An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you’ve translated better and now you’ve eliminated 99% of the risk, and iterating this process you get to effectively no ongoing risk. There is of course risk during the iteration, but that risk can be reasonably small.
A similar argument applies to economic competitiveness: yes, your first agent is pretty slow relative to what it could be, but you can make it faster and faster over time, so you only lose a lot of value during the first few initial phases.
(For the economic value part, this is mostly based on industry experience trying to automate things.)
I have the same intuition, and strongly agree that usually most of the value is in the long tail. The hope is mostly that you can actually keep making progress on the tail as time goes on, especially with the help of your newly built AI systems.
An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you’ve translated better and now you’ve eliminated 99% of the risk...
I don’t see how this ever actually gets around the chicken-and-egg problem.
An analogy: we want to translate from English to Korean. We first obtain a translation dictionary which is 95% accurate, then use it to ask our Korean-speaking friend to help out. Problem is, there’s a very important difference between very similar translations of “help me translate things”—e.g. consider the difference between “what would you say if you wanted to convey X?” and “what should I say if I want to convey X?”, when giving instructions to an AI. Both of those would produce very similar results, right up until everything went wrong. (Let me know if this analogy sounds representative of the strategies you imagine.)
If you do manage to get that first translation exactly right, and successfully ask your friend for help, then you’re good—similar to the “translate how-to-translate” strategy from the OP. And with a 95% accurate dictionary, you might even have a decent chance of getting that first translation right. But if that first translation isn’t perfect, then you need some way to find that out safely—and the 95% accurate dictionary doesn’t make that any easier.
Another way to look at it: the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further. We need some other way to get at the ground truth, in order to actually reduce the error rate. If we know how to convey what-we-want with 95% accuracy, then we need some other way to get at the ground truth of translation in order to increase that accuracy further.
Let me know if this analogy sounds representative of the strategies you imagine.
Yeah, it does. I definitely agree that this doesn’t get around the chicken-and-egg problem, and so shouldn’t be expected to succeed on the first try. It’s more like you get to keep trying this strategy over and over again until you eventually succeed, because if everything goes wrong you just unplug the AI system and start over.
the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further.
I think you get “ground truth data” by trying stuff and seeing whether or not the AI system did what you wanted it to do.
(This does suggest that you wouldn’t ever be able to ask your AI system to do something completely novel without having a human along to ensure it’s what we actually meant, which seems wrong to me, but I can’t articulate why.)
I think you get “ground truth data” by trying stuff and seeing whether or not the AI system did what you wanted it to do.
That’s the sort of strategy where illusion of transparency is a big problem, from a translation point of view. The difficult cases are exactly the cases where the translation usually produces the results you expect, but then produce something completely different in some rare cases.
Another way to put it: if we’re gathering data by seeing whether the system did what we wanted, then the long tail problem works against us pretty badly. Those rare tail-cases are exactly the cases we would need to observe in order to notice problems and improve the system. We’re not going to have very many of them to work with. Ability to generalize from small data sets becomes a key capability, but then we need to translate how-to-generalize in order for the AI to generalize in the ways we want (this gets at the can’t-ask-the-AI-to-do-anything-novel problem).
(The other comment is my main response, but there’s a possibly-tangential issue here.)
In a long-tail world, if we manage to eliminate 95% of problems, then we generate maybe 10% of the value. So now we use our 10%-of-value product to refine our solution. But it seems rather optimistic to hope that a product which achieves only 10% of the value gets us all the way to a 99% solution. It seems far more likely that it gets to, say, a 96% solution. That, in turn, generates maybe 15% of the value, which in turn gets us to a 96.5% solution, and...
Point being: in the long-tail world, it’s at least plausible (and I would say more likely than not) that this iterative strategy doesn’t ever converge to a high-value solution. We get fancier and fancier refinements with decreasing marginal returns, which never come close to handling the long tail.
Now, under this argument, it’s still a fine idea to try the iterative strategy. But you wouldn’t want to bet too heavily on its success, especially without a reliable way to check whether it’s working.
Yup, that is definitely the intuition.
Agreed.
I mean, they continue to manifest in the normal sense, in that when you say “cure cancer”, the AI systems works on a plan to kill everyone; you just now get to stop the AI system from actually running that plan.
All of this is true; I’m more arguing that slow & reversible eliminates ~95% of the problems, and so if it’s easier to do than “full” alignment, then it probably becomes the best thing to do on the margin.
I’d expect we’d be able to solve this over time, e.g. first you use your AI system for simple tasks which you can check quickly, then as you start trusting that you’ve worked out the bugs for those tasks, you let the AI do them faster / without oversight, and move on to more complicated tasks, etc.
(This is a much more testing + engineering based approach; the standard argument against such an approach is that it fails in the presence of optimization.)
It certainly does mean you take a hit to economic competitiveness, I mostly think the hit is not that large and is something we could pay.
I agree with most of this reasoning. I think my main point of departure is that I expect most of the value is in the long tail, i.e. eliminating 95% of problems generates <10% or maybe even <1% of the value. I expect this both in the sense that eliminating 95% of problems unlocks only a small fraction of economic value, and in the sense that eliminating 95% of problems removes only a small fraction of risk. (For the economic value part, this is mostly based on industry experience trying to automate things.)
Optimization is indeed the standard argument for this sort of conclusion, and is a sufficient condition for eliminating 95% of problems to have little impact on risk. But again, it’s not a necessary condition—if the remaining 5% of problems are still existentially deady and likely to come up eventually (but not often enough to be caught in testing), then risk isn’t really decreased. And that’s exactly the sort of situation I expect when viewing translation as the central problem: illusion of transparency is exactly the sort of thing which doesn’t seem like a problem 95% of the time, right up until you realize that everything was completely broken all along.
Anyway, sounds like value-in-the-tail is a central crux here.
Seems somewhat right to me, subject to caveat below.
An important part of my intuition about value-in-the-tail is that if your first solution can knock off 95% of the risk, you can then use the resulting AI system to design a new AI system where you’ve translated better and now you’ve eliminated 99% of the risk, and iterating this process you get to effectively no ongoing risk. There is of course risk during the iteration, but that risk can be reasonably small.
A similar argument applies to economic competitiveness: yes, your first agent is pretty slow relative to what it could be, but you can make it faster and faster over time, so you only lose a lot of value during the first few initial phases.
I have the same intuition, and strongly agree that usually most of the value is in the long tail. The hope is mostly that you can actually keep making progress on the tail as time goes on, especially with the help of your newly built AI systems.
I don’t see how this ever actually gets around the chicken-and-egg problem.
An analogy: we want to translate from English to Korean. We first obtain a translation dictionary which is 95% accurate, then use it to ask our Korean-speaking friend to help out. Problem is, there’s a very important difference between very similar translations of “help me translate things”—e.g. consider the difference between “what would you say if you wanted to convey X?” and “what should I say if I want to convey X?”, when giving instructions to an AI. Both of those would produce very similar results, right up until everything went wrong. (Let me know if this analogy sounds representative of the strategies you imagine.)
If you do manage to get that first translation exactly right, and successfully ask your friend for help, then you’re good—similar to the “translate how-to-translate” strategy from the OP. And with a 95% accurate dictionary, you might even have a decent chance of getting that first translation right. But if that first translation isn’t perfect, then you need some way to find that out safely—and the 95% accurate dictionary doesn’t make that any easier.
Another way to look at it: the chicken-and-egg problem is a ground truth problem. If we have enough data to estimate X to within 5%, then doing clever things with that data is not going reduce that error any further. We need some other way to get at the ground truth, in order to actually reduce the error rate. If we know how to convey what-we-want with 95% accuracy, then we need some other way to get at the ground truth of translation in order to increase that accuracy further.
Yeah, it does. I definitely agree that this doesn’t get around the chicken-and-egg problem, and so shouldn’t be expected to succeed on the first try. It’s more like you get to keep trying this strategy over and over again until you eventually succeed, because if everything goes wrong you just unplug the AI system and start over.
I think you get “ground truth data” by trying stuff and seeing whether or not the AI system did what you wanted it to do.
(This does suggest that you wouldn’t ever be able to ask your AI system to do something completely novel without having a human along to ensure it’s what we actually meant, which seems wrong to me, but I can’t articulate why.)
That’s the sort of strategy where illusion of transparency is a big problem, from a translation point of view. The difficult cases are exactly the cases where the translation usually produces the results you expect, but then produce something completely different in some rare cases.
Another way to put it: if we’re gathering data by seeing whether the system did what we wanted, then the long tail problem works against us pretty badly. Those rare tail-cases are exactly the cases we would need to observe in order to notice problems and improve the system. We’re not going to have very many of them to work with. Ability to generalize from small data sets becomes a key capability, but then we need to translate how-to-generalize in order for the AI to generalize in the ways we want (this gets at the can’t-ask-the-AI-to-do-anything-novel problem).
(The other comment is my main response, but there’s a possibly-tangential issue here.)
In a long-tail world, if we manage to eliminate 95% of problems, then we generate maybe 10% of the value. So now we use our 10%-of-value product to refine our solution. But it seems rather optimistic to hope that a product which achieves only 10% of the value gets us all the way to a 99% solution. It seems far more likely that it gets to, say, a 96% solution. That, in turn, generates maybe 15% of the value, which in turn gets us to a 96.5% solution, and...
Point being: in the long-tail world, it’s at least plausible (and I would say more likely than not) that this iterative strategy doesn’t ever converge to a high-value solution. We get fancier and fancier refinements with decreasing marginal returns, which never come close to handling the long tail.
Now, under this argument, it’s still a fine idea to try the iterative strategy. But you wouldn’t want to bet too heavily on its success, especially without a reliable way to check whether it’s working.
Yeah, this could be a way that things are. My intuition is that it wouldn’t be this way, but I don’t have any good arguments for it.