I’m not sure if alignment research would actually scale with capabilities. It seems more likely to me that increasing capabilities makes alignment harder in some ways (e.g. interpretability), while of course it could become easier in others (e.g. using AI as a tool for alignment). So while alignment techniques will hopefully further advance in the future, it seems doubtful or at least unclear whether they will keep up with capabilities development.
Don’t see why not. It would be far easier than automating neuroscience research. There is an implicit assumption that we won’t wait for alignment research to catch up but people will naturally get more suspicious of models as they get more powerful. We are already seeing the framework for a pause at this step
Additionally, to align ASI correctly, you only need to align weakly super/human levels/below human level intelligent agents who will align super intelligent agents stronger then themselves or convince you to stop progress.
Recursive self alignment is parallel to recursive self improvement.
Assuming it can scale with capabilities, that doesn’t help you if alignment is scaling at y=2x and capabilities at y=123x (for totally random numbers, but you get the point). A quick google search found an article from 2017 claiming that there are some 300k AI researchers worldwide. I see claims around here that there are like 300 alignment researchers. Those numbers can be taken with a large grain of salt, but even so, that’s 1000:1.
As to recursive improvement, nope—check out the tiling problem. Also, “only” is doing a massive amount of work in “only need to align”, seeing as no-one (as far as I can tell) has a good idea how to do that (though there are interesting approaches to some subproblems)
There are 300 alignment researchers because current AI is pretty stupid. As we get smarter AI, clearly alignment concerns get more credence. We already have frameworks for a pause and we will likely see premature alignment failures.
As to recursive improvement, nope—check out the tiling problem. Also, “only” is doing a massive amount of work in “only need to align”, seeing as no-one (as far as I can tell) has a good idea how to do that (though there are interesting approaches to some subproblems)
What? The whole foom argument rests on recursive self improvement. Are you saying that’s wrong?
A lot of foom arguments rest on us having to align nascent gods that can reverse entropy as puny humans. QACI seems to be aiming to align such an entity for example. I think they miss the mark of scalability in ways that make the problem harder and easier. Harder in that you need a method that can be built on. Easier in that you only need to align an iteration of super intelligence.
There is value drift but we can estimate how many iterations of RSA/I there will be and how much value drift is acceptable.
An example could be we develop high levels of interpretability methods to align human level AI which then take over the job for us and align the next generation of AI. We can have estimates of value drift and conclude whether or not it’s acceptable. If the human level AI is properly aligned and concludes it’s impossible it will tell us to stop making a new generation and we will probably listen.
The foom problem is worse because of how hard it is to trust the recursion. Foomability is weakly correlated to whether the foomed entity is aligned. At least from our perspective. That’s why there’s the whole emphasis of getting it right on the first try.
How can you estimate how many of iterations of RSA will happen?
How does interpretability align an AI? It can let you know when things are wrong, but that doesn’t mean it’s aligned.
QACI can potentially solve outer alignment by giving you a rigorous and well specified mathematical target to aim for. That still leaves the other issues (though they are being worked on).
QACI isn’t scalable so by the time an ASI is powerful enough to implement it, you’ll already be dead.
How does interpretability align an AI? It can let you know when things are wrong, but that doesn’t mean it’s aligned.
You are reading too much into the example. If we have a method of aligning a target of slightly greater intelligence with a small value drift, and this method can recursively be applied, then we solve the alignment problem.
This can be even weaker, if a method always exists to align a slightly more capable target with acceptable value drift for any given intelligence, and it can be found by the lesser intelligence, then we only have to solve the alignment problem for the first iteration.
How can you estimate how many of iterations of RSA will happen?
It’s useful to figure out the hard physical limits of intelligence. If we knew this then we could approximate how much value drift is acceptable per iteration.
How do the hard limits of intelligence help? My current understanding is that the hard limits are likely to be something like Jupiter brains, rather than mentats. If each step is only slightly better, won’t that result in a massive amount of tiny steps (even taking into account the nonlinearlity of it)?
Small value drifts are a large problem, if compounded. That’s sort of the premise of a whole load of fiction, where characters change their value systems after sequences of small updates. And that’s just in humans—adding in alien (as in different) minds could complicate this further (or not—that’s the thing about alien minds).
How do the hard limits of intelligence help? My current understanding is that the hard limits are likely to be something like Jupiter brains, rather than mentats. If each step is only slightly better, won’t that result in a massive amount of tiny steps (even taking into account the nonlinearlity of it)?
I think hard limits are a lot lower than most people think. The speed of light takes 1/8th of a second to go across earth so it doesn’t sound too useful to have planet sized module if information transfer is so slow that individual parts will always be out of sync. I haven’t looked at this issue extensively but afaik no one else has either. I think at some point AI will just become a compute cluster. A legion of minds in a hierarchy of some sorts.
(GPT-4 is already said to be 8 subcomponents masquerading as 1)
Obviously this is easier to align as we only have to align individual modules.
Furthermore it matters how much the increment is. As a hypothetical if each step is a 50% jump with a .1% value drift and we hit the limit of a module in 32 jumps, then we will have a 430,000x capabilities increase with a 4% overall drift. I mean obviously we can fiddle these numbers but my point is it’s plausible. The fact that verification is usually far easier than coming up with a solution can help us maintain these low drift rates throughout. It’s not entirely clear how high value should be, just that it’s a potential issue.
Small value drifts are a large problem, if compounded. That’s sort of the premise of a whole load of fiction, where characters change their value systems after sequences of small updates. And that’s just in humans—adding in alien (as in different) minds could complicate this further (or not—that’s the thing about alien minds).
For value drift to be interesting in fiction it has to go catastrophically wrong. I just don’t see why it’s guaranteed to fail and I think the most realistic alignment plans are some version of this.
Don’t see why not. It would be far easier than automating neuroscience research. There is an implicit assumption that we won’t wait for alignment research to catch up but people will naturally get more suspicious of models as they get more powerful. We are already seeing the framework for a pause at this step
Additionally, to align ASI correctly, you only need to align weakly super/human levels/below human level intelligent agents who will align super intelligent agents stronger then themselves or convince you to stop progress.
Recursive self alignment is parallel to recursive self improvement.
Assuming it can scale with capabilities, that doesn’t help you if alignment is scaling at y=2x and capabilities at y=123x (for totally random numbers, but you get the point). A quick google search found an article from 2017 claiming that there are some 300k AI researchers worldwide. I see claims around here that there are like 300 alignment researchers. Those numbers can be taken with a large grain of salt, but even so, that’s 1000:1.
As to recursive improvement, nope—check out the tiling problem. Also, “only” is doing a massive amount of work in “only need to align”, seeing as no-one (as far as I can tell) has a good idea how to do that (though there are interesting approaches to some subproblems)
There are 300 alignment researchers because current AI is pretty stupid. As we get smarter AI, clearly alignment concerns get more credence. We already have frameworks for a pause and we will likely see premature alignment failures.
What? The whole foom argument rests on recursive self improvement. Are you saying that’s wrong?
A lot of foom arguments rest on us having to align nascent gods that can reverse entropy as puny humans. QACI seems to be aiming to align such an entity for example. I think they miss the mark of scalability in ways that make the problem harder and easier. Harder in that you need a method that can be built on. Easier in that you only need to align an iteration of super intelligence.
There is value drift but we can estimate how many iterations of RSA/I there will be and how much value drift is acceptable.
An example could be we develop high levels of interpretability methods to align human level AI which then take over the job for us and align the next generation of AI. We can have estimates of value drift and conclude whether or not it’s acceptable. If the human level AI is properly aligned and concludes it’s impossible it will tell us to stop making a new generation and we will probably listen.
The foom problem is worse because of how hard it is to trust the recursion. Foomability is weakly correlated to whether the foomed entity is aligned. At least from our perspective. That’s why there’s the whole emphasis of getting it right on the first try.
How can you estimate how many of iterations of RSA will happen?
How does interpretability align an AI? It can let you know when things are wrong, but that doesn’t mean it’s aligned.
QACI can potentially solve outer alignment by giving you a rigorous and well specified mathematical target to aim for. That still leaves the other issues (though they are being worked on).
QACI isn’t scalable so by the time an ASI is powerful enough to implement it, you’ll already be dead.
You are reading too much into the example. If we have a method of aligning a target of slightly greater intelligence with a small value drift, and this method can recursively be applied, then we solve the alignment problem.
This can be even weaker, if a method always exists to align a slightly more capable target with acceptable value drift for any given intelligence, and it can be found by the lesser intelligence, then we only have to solve the alignment problem for the first iteration.
It’s useful to figure out the hard physical limits of intelligence. If we knew this then we could approximate how much value drift is acceptable per iteration.
How do the hard limits of intelligence help? My current understanding is that the hard limits are likely to be something like Jupiter brains, rather than mentats. If each step is only slightly better, won’t that result in a massive amount of tiny steps (even taking into account the nonlinearlity of it)?
Small value drifts are a large problem, if compounded. That’s sort of the premise of a whole load of fiction, where characters change their value systems after sequences of small updates. And that’s just in humans—adding in alien (as in different) minds could complicate this further (or not—that’s the thing about alien minds).
I think hard limits are a lot lower than most people think. The speed of light takes 1/8th of a second to go across earth so it doesn’t sound too useful to have planet sized module if information transfer is so slow that individual parts will always be out of sync. I haven’t looked at this issue extensively but afaik no one else has either. I think at some point AI will just become a compute cluster. A legion of minds in a hierarchy of some sorts.
(GPT-4 is already said to be 8 subcomponents masquerading as 1)
Obviously this is easier to align as we only have to align individual modules.
Furthermore it matters how much the increment is. As a hypothetical if each step is a 50% jump with a .1% value drift and we hit the limit of a module in 32 jumps, then we will have a 430,000x capabilities increase with a 4% overall drift. I mean obviously we can fiddle these numbers but my point is it’s plausible. The fact that verification is usually far easier than coming up with a solution can help us maintain these low drift rates throughout. It’s not entirely clear how high value should be, just that it’s a potential issue.
For value drift to be interesting in fiction it has to go catastrophically wrong. I just don’t see why it’s guaranteed to fail and I think the most realistic alignment plans are some version of this.