QACI isn’t scalable so by the time an ASI is powerful enough to implement it, you’ll already be dead.
How does interpretability align an AI? It can let you know when things are wrong, but that doesn’t mean it’s aligned.
You are reading too much into the example. If we have a method of aligning a target of slightly greater intelligence with a small value drift, and this method can recursively be applied, then we solve the alignment problem.
This can be even weaker, if a method always exists to align a slightly more capable target with acceptable value drift for any given intelligence, and it can be found by the lesser intelligence, then we only have to solve the alignment problem for the first iteration.
How can you estimate how many of iterations of RSA will happen?
It’s useful to figure out the hard physical limits of intelligence. If we knew this then we could approximate how much value drift is acceptable per iteration.
How do the hard limits of intelligence help? My current understanding is that the hard limits are likely to be something like Jupiter brains, rather than mentats. If each step is only slightly better, won’t that result in a massive amount of tiny steps (even taking into account the nonlinearlity of it)?
Small value drifts are a large problem, if compounded. That’s sort of the premise of a whole load of fiction, where characters change their value systems after sequences of small updates. And that’s just in humans—adding in alien (as in different) minds could complicate this further (or not—that’s the thing about alien minds).
How do the hard limits of intelligence help? My current understanding is that the hard limits are likely to be something like Jupiter brains, rather than mentats. If each step is only slightly better, won’t that result in a massive amount of tiny steps (even taking into account the nonlinearlity of it)?
I think hard limits are a lot lower than most people think. The speed of light takes 1/8th of a second to go across earth so it doesn’t sound too useful to have planet sized module if information transfer is so slow that individual parts will always be out of sync. I haven’t looked at this issue extensively but afaik no one else has either. I think at some point AI will just become a compute cluster. A legion of minds in a hierarchy of some sorts.
(GPT-4 is already said to be 8 subcomponents masquerading as 1)
Obviously this is easier to align as we only have to align individual modules.
Furthermore it matters how much the increment is. As a hypothetical if each step is a 50% jump with a .1% value drift and we hit the limit of a module in 32 jumps, then we will have a 430,000x capabilities increase with a 4% overall drift. I mean obviously we can fiddle these numbers but my point is it’s plausible. The fact that verification is usually far easier than coming up with a solution can help us maintain these low drift rates throughout. It’s not entirely clear how high value should be, just that it’s a potential issue.
Small value drifts are a large problem, if compounded. That’s sort of the premise of a whole load of fiction, where characters change their value systems after sequences of small updates. And that’s just in humans—adding in alien (as in different) minds could complicate this further (or not—that’s the thing about alien minds).
For value drift to be interesting in fiction it has to go catastrophically wrong. I just don’t see why it’s guaranteed to fail and I think the most realistic alignment plans are some version of this.
QACI isn’t scalable so by the time an ASI is powerful enough to implement it, you’ll already be dead.
You are reading too much into the example. If we have a method of aligning a target of slightly greater intelligence with a small value drift, and this method can recursively be applied, then we solve the alignment problem.
This can be even weaker, if a method always exists to align a slightly more capable target with acceptable value drift for any given intelligence, and it can be found by the lesser intelligence, then we only have to solve the alignment problem for the first iteration.
It’s useful to figure out the hard physical limits of intelligence. If we knew this then we could approximate how much value drift is acceptable per iteration.
How do the hard limits of intelligence help? My current understanding is that the hard limits are likely to be something like Jupiter brains, rather than mentats. If each step is only slightly better, won’t that result in a massive amount of tiny steps (even taking into account the nonlinearlity of it)?
Small value drifts are a large problem, if compounded. That’s sort of the premise of a whole load of fiction, where characters change their value systems after sequences of small updates. And that’s just in humans—adding in alien (as in different) minds could complicate this further (or not—that’s the thing about alien minds).
I think hard limits are a lot lower than most people think. The speed of light takes 1/8th of a second to go across earth so it doesn’t sound too useful to have planet sized module if information transfer is so slow that individual parts will always be out of sync. I haven’t looked at this issue extensively but afaik no one else has either. I think at some point AI will just become a compute cluster. A legion of minds in a hierarchy of some sorts.
(GPT-4 is already said to be 8 subcomponents masquerading as 1)
Obviously this is easier to align as we only have to align individual modules.
Furthermore it matters how much the increment is. As a hypothetical if each step is a 50% jump with a .1% value drift and we hit the limit of a module in 32 jumps, then we will have a 430,000x capabilities increase with a 4% overall drift. I mean obviously we can fiddle these numbers but my point is it’s plausible. The fact that verification is usually far easier than coming up with a solution can help us maintain these low drift rates throughout. It’s not entirely clear how high value should be, just that it’s a potential issue.
For value drift to be interesting in fiction it has to go catastrophically wrong. I just don’t see why it’s guaranteed to fail and I think the most realistic alignment plans are some version of this.