I had recently posted a question asking about if iterated amplification was actually more powerful than mere mimicry and arguing that it was not. I had thought I was making a pretty significant point, but the post attracted very little attention. I’m not saying this is a bad thing, but I’m not really sure why it happened, so I would appreciate some insight about how I can contribute more usefully.
Iterated amplification seems to be the leading proposal for created aligned AI, so I thought a post arguing against it, if correct, would be a useful contribution. Perhaps there is some mistake in my reasoning, but I have yet to see any mentioned. It’s possible that people have already thought of this consideration and posted about it, but I have yet to find any, so I’m not really sure.
Would it have been better posting it as an actual post instead of framing it as a question? I have some more to say to argue for mimicry than I mentioned in the question; would it be worthwhile for me to add it and then post this as a non-question post?
It’s true that most problems could be delegated to uploads, and any specific design is a design that the uploads could come up with just as well or better. The issue is that we don’t have uploads, and most plans to get them before AGI involve the kind of hypothetical AI know-how that might easily be used to build an agentic AGI, the risk the uploads are supposed to resolve.
Thus the “humans” of a realistic implementation of HCH are expected to be vague imitations of humans that only function somewhat sensibly in familiar situations and for a short time, not fully functional uploads, and most of the point of the specific designs is to mitigate the imperfection of their initial form, to make something safe/useful out of this plausibly feasible ingredient. One of the contentious points about this is whether it’s actually possible to build something useful (let alone safe) out of such imperfect imitations, even if we build a large system out of them that uses implausible amount of resources. This is what happens with an HCH that can use an infinite number of actual uploads (exact imitations) that are still restricted to an hour or a day of thinking/learning (and then essentially get erased, that is can’t make further use of the things they learned). Designing something safe/useful in the exact imitation HCH setting is an easier problem than doing so in a realistic setting, so it’s a good starting point.
Thanks for the response. To be clear, when discussing mimics, I did not have in mind perfect uploads of people. Instead, they could indeed be rather limited imitations. For example, an AI designing improvements to itself doesn’t need to actually have a generally faithful imitation of human behavior. Instead, it could just know a few things, like, “make this algorithm score better on this thing without taking over the world”.
Still, I can see how, when it comes to especially limited imitations, iterated amplification could be valuable. This seems especially true if the imitations are unreliable in even narrow situations. It would be problematic is an AI tasked with designing powerful AI didn’t get the “act corrigibly, and don’t take over the world” part reliably right.
I’ve been thinking about what you’ve said about iterated amplification, and there are some things I’m unsure of. I’m still rather skeptical of the benefit of iterated amplification, so I’d really appreciate a response.
You mentioned that iterated amplification can be useful when you have only very limited, domain-specific models of human behavior, where such models would be unable to come up with the ability to create code. However, there are two things I’m wondering about. The first is that it seems to me that, for a wide range of situations, you need a general and robustly accurate model of human behavior to perform well. The second is that, even if you don’t have a general model of human behavior, it seems to me that it’s sufficient to only have one amplification step, which I suppose isn’t iterated amplification. And the big benefit to avoiding iterated amplification is that iterated amplification results in exponential decreases in reliability from compounding errors on each distillation step, but with a single amplification step, this exponential decrease in reliability wouldn’t occur.
For the first topic, suppose your AI is trained to make movies. I think just about every human value is relevant to the creation of movies, because humans usually like movies with a happy ending, and to make an ending happy you need to understand what humans consider a “happy ending”.
Further, you would need an accurate model of human cognitive capabilities. To make a good movie, it needs to be easy enough for humans to understand. But sometimes it also shouldn’t be too easy, because that can remove the mystery of it.
And the above is not just true for movies: I think creating other forms of entertainment would involve the same things as above.
Could you do the above with only some domain-limited model of what counts as confusing or a good or bad ending in the context of movies? It’s not clear to me that this is possible. Movies involve a very wide variety of situations, and you need to keep things understandable and resulting in a happy ending in all of those circumstances. I don’t see how could you robustly do the above without a general model of what people people find confusing or otherwise bad.
Further, whenever an AI needs to explain something to humans, it seems to me that it’s important that it has an accurate model of what humans can understand and not understand. Is there any way to do this with purely domain-specific models rather than with a general understanding of what people find confusing? It’s not clear to me that this is possible. For example, imagine an AI that needs to explain many different things. Maybe it’s tasked with creating learning materials or making the news. With such a broad category of things the AI needs to explain, it’s really not clear to me how an AI could do this without a general model of what makes things confusing or not.
Also more generally, it seems to me that whenever the AI is involved with human interaction in novel circumstances, it will need an accurate model of what people like and dislike. For example, consider an AI tasked with coming up with a plan for human workers. Doing so has the potential to involve an extremely wide range of values. For example, humans generally value novelty, autonomy, not feeling embarrassed, not being bored, not being overly pressured, not feeling offended, and not seeing disgusting or ugly things.
Could you have an AI learn to avoid things things with only domain-specific models, rather than a general understanding of what people value and disvalue? I’m not sure how to do this. Maybe you could learn models that work for reflecting people’s values in limited circumstances. However, I think an essential component of intelligence is to come up with novel plans involving novel situations. And I don’t see how an agent could do this without a general understanding of values. For example, the AI might create entire new industries, and it would be important that any human workers in those industries would have satisfactory conditions.
Now, for the second topic: using amplification without iteration.
First off, I want to note that, even without a general model of humans, it’s still not really clear to me that you need any amplification at all. As I’ve said before, even mere human imitation the potential to result in extremely high intelligence simply by doing the same things humans do, but much faster. As I mentioned previously, consider the human output to be published research papers from top researchers, and the AI is tasked with mimicking it. Then the AI could take the research papers as the human output and use this to create future papers but far far faster.
But suppose you do still need amplification. Then I don’t see why one amplification step wouldn’t be enough. I think that if you put together a sufficiently large number of intelligent humans and give them unlimited time to think, they’d be able to solve pretty much anything that iterated amplification with HCH would be able to solve. So, instead of having multiple amplification and distillation steps, you could instead just have one very large amplification step that would involve a large enough number of humans models interacting that it could solve pretty much anything.
If the amplification step involve a sufficiently large number of people, you might be concerned that it would be intractable to emulate them all.
I’m not sure if this would be a problem. Consider again the AI designed to mimic the research papers of top researchers. I think that often a small number of top researchers are responsible for a large proportion of research progress, so the AI could potentially just see that output of the top, say, 100 or 1000 researchers working together would be. And the AI would potentially be able to produce the outputs of each researcher with far less computation. That sounds plausibly like enough to me.
But suppose that’s not enough, and emulating every human individually during the amplification step is intractable. Then here’s how I think you can get around this: train not only a human model, but also a system of approximating the output of an expensive computation with much lower computational cost. Then, for the amplification step, you can define an computing involving an extremely large number of interacting emulated humans, and then allow the approximation system to come up with approximations to this without needing to directly emulate every human.
To give a sense of how this might work, note that in a computation, often a small amount of the parts of the computation account for a large part of the output. For example, if you are trying to approximate a computation about gravity, commonly only the closest, most massive objects have significant gravitational effect on something, and you can ignore the rest. Similarly, rather than simulate individual atoms, it’s much more efficient to come up with groups of large number of atoms, and consider their effect as a group. The same is true for other computations involving many small components.
To emulate humans, you could potentially do the same things as you would when simulating gravity. Specifically, an AI may be able to consider groups of humans and infer what the final output of that group will be, without actually needing to emulate each one individually. Further, for very challenging topics, many people may fail to contribute anything to the final result, so the could potentially avoid emulating them at all.
So I still can’t really see the benefit of iterated amplification. Of course, I could be missing something, so I’m interesting in hearing what you think.
One potential problem is that it might be hard to come up with good training data for an arbitrary-function-approximator, since finding the exact output of expensive functions would be expensive. However, it’s not clear to me how big of a problem this would be. As I’ve said before, even the output of a 100 or 1000 humans interacting could potentially be all the AI ever needs, and with sufficient fast approximations of individual humans, this could be tractable to create training data for.
Further, I bet the AI could learn a lot about arbitrary-function approximation just by training on approximating functions that are already reasonably fast the compute. I think the basic techniques to quickly approximating functions are what I mentioned before: come up with abstract objects that involve groups of individual components, and know when to stop performing the computation on a certain object because it’s clear it will have little effect on the final result.
Amplification induces a dynamic in the model space, it’s a concept of improving models (or equivalently in this context, distributions). This can be useful when you don’t have good datasets, in various ways.
For robustness, you have a dataset that’s drawn from the wrong distribution, and you need to act in a way that you would’ve acted if it was drawn from the correct distribution. If you have an amplification dynamic that moves models towards few attractors, then changing the starting point (training distribution compared to target distribution) probably won’t matter. At that point the issue is for the attractor to be useful with respect to all those starting distributions/models. This doesn’t automatically make sense, comparing models by usefulness doesn’t fall out of the other concepts.
For chess, you’d use the idea of winning games (better models are those that win more, thus amplification should move models towards winning), which is not inherent in any dataset of moves. For AGI, this is much more nebulous, but things like reflection (thinking about a problem longer, conferring with others, etc.) seem like a possible way of bootstrapping a relevant amplification, if goodharting is kept in check throughout the process.
For robustness, you have a dataset that’s drawn from the wrong distribution, and you need to act in a way that you would’ve acted if it was drawn from the correct distribution. If you have an amplification dynamic that moves models towards few attractors, then changing the starting point (training distribution compared to target distribution) probably won’t matter. At that point the issue is for the attractor to be useful with respect to all those starting distributions/models. This doesn’t automatically make sense, comparing models by usefulness doesn’t fall out of the other concepts.
Interesting. Do you have any links discussing this? I read Paul Christiano’s post on reliability amplification, but couldn’t find mention of this. And, alas, I’m having trouble finding other relevant articles online.
Amplification induces a dynamic in the model space, it’s a concept of improving models (or equivalently in this context, distributions). This can be useful when you don’t have good datasets, in various ways. Also it ignores independence when talking about recomputing things
Yes, that’s true. I’m not claiming that iterated amplification doesn’t have advantages. What I’m wondering is if non-iterated amplification is a viable alternative. I haven’t seen non-iterated amplification proposed before for creating algorithm AI. Amplification without iteration has the disadvantage that it may not have the attractor dynamic iterated amplification has, but it also doesn’t have the exponentially increasing unreliability iterated amplification has. So, to me at least, it’s not clear to me if pursuing iterated amplification is a more promising strategy than amplification without iteration.
For me, the interesting thing about IDA is not capability amplification like self-play, but an attitude towards generation of datasets as a point of intervention into the workings of an AI for all kinds of improvements. So we have some AI that we want to make better in some respect, and the IDA methodology says that to do that, we should employ the AI to generate a dataset for retraining a new version of it that’s better than the original dataset in that respect. Then we retrain the AI using the new dataset. So amplification unpackages the AI into the form of an appropriately influenced dataset, and then learning repackages it for further use.
I had recently posted a question asking about if iterated amplification was actually more powerful than mere mimicry and arguing that it was not. I had thought I was making a pretty significant point, but the post attracted very little attention. I’m not saying this is a bad thing, but I’m not really sure why it happened, so I would appreciate some insight about how I can contribute more usefully.
Iterated amplification seems to be the leading proposal for created aligned AI, so I thought a post arguing against it, if correct, would be a useful contribution. Perhaps there is some mistake in my reasoning, but I have yet to see any mentioned. It’s possible that people have already thought of this consideration and posted about it, but I have yet to find any, so I’m not really sure.
Would it have been better posting it as an actual post instead of framing it as a question? I have some more to say to argue for mimicry than I mentioned in the question; would it be worthwhile for me to add it and then post this as a non-question post?
It’s true that most problems could be delegated to uploads, and any specific design is a design that the uploads could come up with just as well or better. The issue is that we don’t have uploads, and most plans to get them before AGI involve the kind of hypothetical AI know-how that might easily be used to build an agentic AGI, the risk the uploads are supposed to resolve.
Thus the “humans” of a realistic implementation of HCH are expected to be vague imitations of humans that only function somewhat sensibly in familiar situations and for a short time, not fully functional uploads, and most of the point of the specific designs is to mitigate the imperfection of their initial form, to make something safe/useful out of this plausibly feasible ingredient. One of the contentious points about this is whether it’s actually possible to build something useful (let alone safe) out of such imperfect imitations, even if we build a large system out of them that uses implausible amount of resources. This is what happens with an HCH that can use an infinite number of actual uploads (exact imitations) that are still restricted to an hour or a day of thinking/learning (and then essentially get erased, that is can’t make further use of the things they learned). Designing something safe/useful in the exact imitation HCH setting is an easier problem than doing so in a realistic setting, so it’s a good starting point.
Thanks for the response. To be clear, when discussing mimics, I did not have in mind perfect uploads of people. Instead, they could indeed be rather limited imitations. For example, an AI designing improvements to itself doesn’t need to actually have a generally faithful imitation of human behavior. Instead, it could just know a few things, like, “make this algorithm score better on this thing without taking over the world”.
Still, I can see how, when it comes to especially limited imitations, iterated amplification could be valuable. This seems especially true if the imitations are unreliable in even narrow situations. It would be problematic is an AI tasked with designing powerful AI didn’t get the “act corrigibly, and don’t take over the world” part reliably right.
I’ve been thinking about what you’ve said about iterated amplification, and there are some things I’m unsure of. I’m still rather skeptical of the benefit of iterated amplification, so I’d really appreciate a response.
You mentioned that iterated amplification can be useful when you have only very limited, domain-specific models of human behavior, where such models would be unable to come up with the ability to create code. However, there are two things I’m wondering about. The first is that it seems to me that, for a wide range of situations, you need a general and robustly accurate model of human behavior to perform well. The second is that, even if you don’t have a general model of human behavior, it seems to me that it’s sufficient to only have one amplification step, which I suppose isn’t iterated amplification. And the big benefit to avoiding iterated amplification is that iterated amplification results in exponential decreases in reliability from compounding errors on each distillation step, but with a single amplification step, this exponential decrease in reliability wouldn’t occur.
For the first topic, suppose your AI is trained to make movies. I think just about every human value is relevant to the creation of movies, because humans usually like movies with a happy ending, and to make an ending happy you need to understand what humans consider a “happy ending”.
Further, you would need an accurate model of human cognitive capabilities. To make a good movie, it needs to be easy enough for humans to understand. But sometimes it also shouldn’t be too easy, because that can remove the mystery of it.
And the above is not just true for movies: I think creating other forms of entertainment would involve the same things as above.
Could you do the above with only some domain-limited model of what counts as confusing or a good or bad ending in the context of movies? It’s not clear to me that this is possible. Movies involve a very wide variety of situations, and you need to keep things understandable and resulting in a happy ending in all of those circumstances. I don’t see how could you robustly do the above without a general model of what people people find confusing or otherwise bad.
Further, whenever an AI needs to explain something to humans, it seems to me that it’s important that it has an accurate model of what humans can understand and not understand. Is there any way to do this with purely domain-specific models rather than with a general understanding of what people find confusing? It’s not clear to me that this is possible. For example, imagine an AI that needs to explain many different things. Maybe it’s tasked with creating learning materials or making the news. With such a broad category of things the AI needs to explain, it’s really not clear to me how an AI could do this without a general model of what makes things confusing or not.
Also more generally, it seems to me that whenever the AI is involved with human interaction in novel circumstances, it will need an accurate model of what people like and dislike. For example, consider an AI tasked with coming up with a plan for human workers. Doing so has the potential to involve an extremely wide range of values. For example, humans generally value novelty, autonomy, not feeling embarrassed, not being bored, not being overly pressured, not feeling offended, and not seeing disgusting or ugly things.
Could you have an AI learn to avoid things things with only domain-specific models, rather than a general understanding of what people value and disvalue? I’m not sure how to do this. Maybe you could learn models that work for reflecting people’s values in limited circumstances. However, I think an essential component of intelligence is to come up with novel plans involving novel situations. And I don’t see how an agent could do this without a general understanding of values. For example, the AI might create entire new industries, and it would be important that any human workers in those industries would have satisfactory conditions.
Now, for the second topic: using amplification without iteration.
First off, I want to note that, even without a general model of humans, it’s still not really clear to me that you need any amplification at all. As I’ve said before, even mere human imitation the potential to result in extremely high intelligence simply by doing the same things humans do, but much faster. As I mentioned previously, consider the human output to be published research papers from top researchers, and the AI is tasked with mimicking it. Then the AI could take the research papers as the human output and use this to create future papers but far far faster.
But suppose you do still need amplification. Then I don’t see why one amplification step wouldn’t be enough. I think that if you put together a sufficiently large number of intelligent humans and give them unlimited time to think, they’d be able to solve pretty much anything that iterated amplification with HCH would be able to solve. So, instead of having multiple amplification and distillation steps, you could instead just have one very large amplification step that would involve a large enough number of humans models interacting that it could solve pretty much anything.
If the amplification step involve a sufficiently large number of people, you might be concerned that it would be intractable to emulate them all.
I’m not sure if this would be a problem. Consider again the AI designed to mimic the research papers of top researchers. I think that often a small number of top researchers are responsible for a large proportion of research progress, so the AI could potentially just see that output of the top, say, 100 or 1000 researchers working together would be. And the AI would potentially be able to produce the outputs of each researcher with far less computation. That sounds plausibly like enough to me.
But suppose that’s not enough, and emulating every human individually during the amplification step is intractable. Then here’s how I think you can get around this: train not only a human model, but also a system of approximating the output of an expensive computation with much lower computational cost. Then, for the amplification step, you can define an computing involving an extremely large number of interacting emulated humans, and then allow the approximation system to come up with approximations to this without needing to directly emulate every human.
To give a sense of how this might work, note that in a computation, often a small amount of the parts of the computation account for a large part of the output. For example, if you are trying to approximate a computation about gravity, commonly only the closest, most massive objects have significant gravitational effect on something, and you can ignore the rest. Similarly, rather than simulate individual atoms, it’s much more efficient to come up with groups of large number of atoms, and consider their effect as a group. The same is true for other computations involving many small components.
To emulate humans, you could potentially do the same things as you would when simulating gravity. Specifically, an AI may be able to consider groups of humans and infer what the final output of that group will be, without actually needing to emulate each one individually. Further, for very challenging topics, many people may fail to contribute anything to the final result, so the could potentially avoid emulating them at all.
So I still can’t really see the benefit of iterated amplification. Of course, I could be missing something, so I’m interesting in hearing what you think.
One potential problem is that it might be hard to come up with good training data for an arbitrary-function-approximator, since finding the exact output of expensive functions would be expensive. However, it’s not clear to me how big of a problem this would be. As I’ve said before, even the output of a 100 or 1000 humans interacting could potentially be all the AI ever needs, and with sufficient fast approximations of individual humans, this could be tractable to create training data for.
Further, I bet the AI could learn a lot about arbitrary-function approximation just by training on approximating functions that are already reasonably fast the compute. I think the basic techniques to quickly approximating functions are what I mentioned before: come up with abstract objects that involve groups of individual components, and know when to stop performing the computation on a certain object because it’s clear it will have little effect on the final result.
Amplification induces a dynamic in the model space, it’s a concept of improving models (or equivalently in this context, distributions). This can be useful when you don’t have good datasets, in various ways.
For robustness, you have a dataset that’s drawn from the wrong distribution, and you need to act in a way that you would’ve acted if it was drawn from the correct distribution. If you have an amplification dynamic that moves models towards few attractors, then changing the starting point (training distribution compared to target distribution) probably won’t matter. At that point the issue is for the attractor to be useful with respect to all those starting distributions/models. This doesn’t automatically make sense, comparing models by usefulness doesn’t fall out of the other concepts.
For chess, you’d use the idea of winning games (better models are those that win more, thus amplification should move models towards winning), which is not inherent in any dataset of moves. For AGI, this is much more nebulous, but things like reflection (thinking about a problem longer, conferring with others, etc.) seem like a possible way of bootstrapping a relevant amplification, if goodharting is kept in check throughout the process.
Interesting. Do you have any links discussing this? I read Paul Christiano’s post on reliability amplification, but couldn’t find mention of this. And, alas, I’m having trouble finding other relevant articles online.
Yes, that’s true. I’m not claiming that iterated amplification doesn’t have advantages. What I’m wondering is if non-iterated amplification is a viable alternative. I haven’t seen non-iterated amplification proposed before for creating algorithm AI. Amplification without iteration has the disadvantage that it may not have the attractor dynamic iterated amplification has, but it also doesn’t have the exponentially increasing unreliability iterated amplification has. So, to me at least, it’s not clear to me if pursuing iterated amplification is a more promising strategy than amplification without iteration.
For me, the interesting thing about IDA is not capability amplification like self-play, but an attitude towards generation of datasets as a point of intervention into the workings of an AI for all kinds of improvements. So we have some AI that we want to make better in some respect, and the IDA methodology says that to do that, we should employ the AI to generate a dataset for retraining a new version of it that’s better than the original dataset in that respect. Then we retrain the AI using the new dataset. So amplification unpackages the AI into the form of an appropriately influenced dataset, and then learning repackages it for further use.