You mention that you’re unsure if the questions in this post were importantly missing from the conversation. If you know of prior discussion on these points, please let me know.
My impression is that the ease of agentized LLMs was well recognized, but the potential large upsides and changes for alignment projects was not previously recognized in this community. It seems like if there’s even a perceived decent chance of that direction, we should’ve been hearing about it.
I’m going to do a more careful post and submit it to Alignment Forum. That should turn up prior work on this.
Did you know about “by default, GPTs think in plain sight”? It doesn’t explicitly talk about agentized GPTs but was discussing the impact this has on GPTs for AGI and how it affects the risks, and what we should do about it (eg. maybe rlhf is dangerous)
Thank you. I think it is relevant. I just found it yesterday following up on this. The comment there by Gwern is a really interesting example of how we could accidentally introduce pressure for them to use steganography so their thoughts aren’t in English.
What I’m excited about is that agentizing them, while dangerous, could mean they not only think in plain sight, but they’re actually what gets used. That would cross from only being able to say how to get alignment, to making it so.ething the world would actually do.
You mention that you’re unsure if the questions in this post were importantly missing from the conversation. If you know of prior discussion on these points, please let me know.
My impression is that the ease of agentized LLMs was well recognized, but the potential large upsides and changes for alignment projects was not previously recognized in this community. It seems like if there’s even a perceived decent chance of that direction, we should’ve been hearing about it.
I’m going to do a more careful post and submit it to Alignment Forum. That should turn up prior work on this.
Did you know about “by default, GPTs think in plain sight”?
It doesn’t explicitly talk about agentized GPTs but was discussing the impact this has on GPTs for AGI and how it affects the risks, and what we should do about it (eg. maybe rlhf is dangerous)
Thank you. I think it is relevant. I just found it yesterday following up on this. The comment there by Gwern is a really interesting example of how we could accidentally introduce pressure for them to use steganography so their thoughts aren’t in English.
What I’m excited about is that agentizing them, while dangerous, could mean they not only think in plain sight, but they’re actually what gets used. That would cross from only being able to say how to get alignment, to making it so.ething the world would actually do.