or we need to figure out some way to access the inaccessible information that “A* leads to lots of human flourishing.”
To help check my understanding, your previously described proposal to access this “inaccessible” information involves building corrigible AI via iterated amplification, then using that AI to capture “flexible influence over the future”, right? Have you become more pessimistic about this proposal, or are you just explaining some existing doubts? Can you explain in more detail why you think it may fail?
(I’ll try to guess.) Is it that corrigibility is about short-term preferences-on-reflection and short-term preferences-on-reflection may themselves be inaccessible information?
I can pay inaccessible costs for an accessible gain — for example leaking critical information, or alienating an important ally, or going into debt, or making short-sighted tradeoffs. Moreover, if there are other actors in the world, they can try to get me to make bad tradeoffs by hiding real costs.
This seems similar to what I wrote in an earlier thread: “What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)” At the time I thought you proposed to solve this problem by using the user’s “preferences-on-reflection”, which presumably would correctly value all resources/costs. So again is it just that “preferences-on-reflection” may itself be inaccessible?
Overall I don’t think it’s very plausible that amplification or debate can be a scalable AI alignment solution on their own, mostly for the kinds of reasons discussed in this post — we will eventually run into some inaccessible knowledge that is never produced by amplification, and so never winds up in your distilled agents.
Besides the above, can you give some more examples of (what you think may be) “inaccessible knowledge that is never produced by amplification”?
(I guess an overall feedback is that in most of the post you discuss inaccessible information without talking about amplification, and then quickly talk about amplification in the last section, but it’s not easy to see how the two ideas relate without more explanations and examples.)
To help check my understanding, your previously described proposal to access this “inaccessible” information involves building corrigible AI via iterated amplification, then using that AI to capture “flexible influence over the future”, right? Have you become more pessimistic about this proposal, or are you just explaining some existing doubts? Can you explain in more detail why you think it may fail?
(I’ll try to guess.) Is it that corrigibility is about short-term preferences-on-reflection and short-term preferences-on-reflection may themselves be inaccessible information?
I think that’s right. The difficulty is that short-term preferences-on-reflection depend on “how good is this situation actually?” and that judgment is inaccessible.
This post doesn’t reflect me becoming more pessimistic about iterated amplification or alignment overall. This post is part of the effort to pin down the hard cases for iterated amplification, which I suspect will also be hard cases for other alignment strategies (for the kinds of reasons discussed in this post).
This seems similar to what I wrote in an earlier thread: “What if the user fails to realize that a certain kind of resource is valuable?
Yeah, I think that’s similar. I’m including this as part of the alignment problem—if unaligned AIs realize that a certain kind of resource is valuable but aligned AIs don’t realize that, or can’t integrate it with knowledge about what the users want (well enough to do strategy stealing) then we’ve failed to build competitive aligned AI.
(By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)”
Yes.
At the time I thought you proposed to solve this problem by using the user’s “preferences-on-reflection”, which presumably would correctly value all resources/costs. So again is it just that “preferences-on-reflection” may itself be inaccessible?
Yes.
Besides the above, can you give some more examples of (what you think may be) “inaccessible knowledge that is never produced by amplification”?
If we are using iterated amplification to try to train a system that answers the question “What action will put me in the best position to flourish over the long term?” then in some sense the only inaccessible information that matters is “To what extent will this action put me in a good position to flourish?” That information is potentially inaccessible because it depends on the kind of inaccessible information described in this post—what technologies are valuable? what’s the political situation? am I being manipulated? is my physical environment being manipulated?---and so forth. That information in turn is potentially inaccessible because it may depend on internal features of models that are only validated by trial and error, for which we can’t elicit the correct answer either by directly checking it nor by transfer from other accessible features of the model.
(I might be misunderstanding your question.)
(I guess an overall feedback is that in most of the post you discuss inaccessible information without talking about amplification, and then quickly talk about amplification in the last section, but it’s not easy to see how the two ideas relate without more explanations and examples.)
By default I don’t expect to give enough explanations or examples :) My next step in this direction will be thinking through possible approaches for eliciting inaccessible information, which I may write about but which I don’t expect to be significantly more useful than this. I’m not that motivated to invest a ton of time in writing about these issues clearly because I think it’s fairly likely that my understanding will change substantially with more thinking, and I think this isn’t a natural kind of “checkpoint” to try to explain clearly. Like most posts on my blog, you should probably regard this primarily as a record of Paul’s thinking. (Though it would be great if it could be useful as explanation as a side effect, and I’m willing to put in a some time to try to make it useful as explanation, just not the amount of time that I expect would be required.)
(My next steps on exposition will be trying to better explain more fundamental aspects of my view.)
To help check my understanding, your previously described proposal to access this “inaccessible” information involves building corrigible AI via iterated amplification, then using that AI to capture “flexible influence over the future”, right? Have you become more pessimistic about this proposal, or are you just explaining some existing doubts? Can you explain in more detail why you think it may fail?
(I’ll try to guess.) Is it that corrigibility is about short-term preferences-on-reflection and short-term preferences-on-reflection may themselves be inaccessible information?
This seems similar to what I wrote in an earlier thread: “What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)” At the time I thought you proposed to solve this problem by using the user’s “preferences-on-reflection”, which presumably would correctly value all resources/costs. So again is it just that “preferences-on-reflection” may itself be inaccessible?
Besides the above, can you give some more examples of (what you think may be) “inaccessible knowledge that is never produced by amplification”?
(I guess an overall feedback is that in most of the post you discuss inaccessible information without talking about amplification, and then quickly talk about amplification in the last section, but it’s not easy to see how the two ideas relate without more explanations and examples.)
I think that’s right. The difficulty is that short-term preferences-on-reflection depend on “how good is this situation actually?” and that judgment is inaccessible.
This post doesn’t reflect me becoming more pessimistic about iterated amplification or alignment overall. This post is part of the effort to pin down the hard cases for iterated amplification, which I suspect will also be hard cases for other alignment strategies (for the kinds of reasons discussed in this post).
Yeah, I think that’s similar. I’m including this as part of the alignment problem—if unaligned AIs realize that a certain kind of resource is valuable but aligned AIs don’t realize that, or can’t integrate it with knowledge about what the users want (well enough to do strategy stealing) then we’ve failed to build competitive aligned AI.
Yes.
Yes.
If we are using iterated amplification to try to train a system that answers the question “What action will put me in the best position to flourish over the long term?” then in some sense the only inaccessible information that matters is “To what extent will this action put me in a good position to flourish?” That information is potentially inaccessible because it depends on the kind of inaccessible information described in this post—what technologies are valuable? what’s the political situation? am I being manipulated? is my physical environment being manipulated?---and so forth. That information in turn is potentially inaccessible because it may depend on internal features of models that are only validated by trial and error, for which we can’t elicit the correct answer either by directly checking it nor by transfer from other accessible features of the model.
(I might be misunderstanding your question.)
By default I don’t expect to give enough explanations or examples :) My next step in this direction will be thinking through possible approaches for eliciting inaccessible information, which I may write about but which I don’t expect to be significantly more useful than this. I’m not that motivated to invest a ton of time in writing about these issues clearly because I think it’s fairly likely that my understanding will change substantially with more thinking, and I think this isn’t a natural kind of “checkpoint” to try to explain clearly. Like most posts on my blog, you should probably regard this primarily as a record of Paul’s thinking. (Though it would be great if it could be useful as explanation as a side effect, and I’m willing to put in a some time to try to make it useful as explanation, just not the amount of time that I expect would be required.)
(My next steps on exposition will be trying to better explain more fundamental aspects of my view.)