paulfchristiano comments on Inaccessible information

paulfchristiano 3 Jun 2020 23:36 UTC
LW: 8 AF: 5
AF
To help check my understanding, your previously described proposal to access this “inaccessible” information involves building corrigible AI via iterated amplification, then using that AI to capture “flexible influence over the future”, right? Have you become more pessimistic about this proposal, or are you just explaining some existing doubts? Can you explain in more detail why you think it may fail?
(I’ll try to guess.) Is it that corrigibility is about short-term preferences-on-reflection and short-term preferences-on-reflection may themselves be inaccessible information?
I think that’s right. The difficulty is that short-term preferences-on-reflection depend on “how good is this situation actually?” and that judgment is inaccessible.
This post doesn’t reflect me becoming more pessimistic about iterated amplification or alignment overall. This post is part of the effort to pin down the hard cases for iterated amplification, which I suspect will also be hard cases for other alignment strategies (for the kinds of reasons discussed in this post).
This seems similar to what I wrote in an earlier thread: “What if the user fails to realize that a certain kind of resource is valuable?
Yeah, I think that’s similar. I’m including this as part of the alignment problem—if unaligned AIs realize that a certain kind of resource is valuable but aligned AIs don’t realize that, or can’t integrate it with knowledge about what the users want (well enough to do strategy stealing) then we’ve failed to build competitive aligned AI.
(By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)”
Yes.
At the time I thought you proposed to solve this problem by using the user’s “preferences-on-reflection”, which presumably would correctly value all resources/costs. So again is it just that “preferences-on-reflection” may itself be inaccessible?
Yes.
Besides the above, can you give some more examples of (what you think may be) “inaccessible knowledge that is never produced by amplification”?
If we are using iterated amplification to try to train a system that answers the question “What action will put me in the best position to flourish over the long term?” then in some sense the only inaccessible information that matters is “To what extent will this action put me in a good position to flourish?” That information is potentially inaccessible because it depends on the kind of inaccessible information described in this post—what technologies are valuable? what’s the political situation? am I being manipulated? is my physical environment being manipulated?---and so forth. That information in turn is potentially inaccessible because it may depend on internal features of models that are only validated by trial and error, for which we can’t elicit the correct answer either by directly checking it nor by transfer from other accessible features of the model.
(I might be misunderstanding your question.)
(I guess an overall feedback is that in most of the post you discuss inaccessible information without talking about amplification, and then quickly talk about amplification in the last section, but it’s not easy to see how the two ideas relate without more explanations and examples.)
By default I don’t expect to give enough explanations or examples :) My next step in this direction will be thinking through possible approaches for eliciting inaccessible information, which I may write about but which I don’t expect to be significantly more useful than this. I’m not that motivated to invest a ton of time in writing about these issues clearly because I think it’s fairly likely that my understanding will change substantially with more thinking, and I think this isn’t a natural kind of “checkpoint” to try to explain clearly. Like most posts on my blog, you should probably regard this primarily as a record of Paul’s thinking. (Though it would be great if it could be useful as explanation as a side effect, and I’m willing to put in a some time to try to make it useful as explanation, just not the amount of time that I expect would be required.)
(My next steps on exposition will be trying to better explain more fundamental aspects of my view.)
What links here?
- Wei Dai's comment on My research methodology by paulfchristiano (26 Mar 2021 2:24 UTC; 9 points)
- paulfchristiano's comment on My research methodology by paulfchristiano (9 Apr 2021 1:31 UTC; 5 points)