My comment will be vague because I’m not sure how much permission I have to share this or if it’s been publicly said somewhere and I’m just unaware, but I talked to an AI researcher at one of the major companies/labs working on things like LLMs several years ago, before even GPT-1 was out, and they told me that your reason 10 was basically their whole reason for wanting to work on language models.
Good for them! I’m really happy some of them saw this coming. To my embarrassment, neither I nor anyone else I know in the community saw this coming. I did see self-talk wrappers for LLMs as a way to give them agency; I haven’t said anything since it could be an infohazard for capabilities. But I didn’t notice how easy that would make initial alignment, or I would’ve been shouting about it. I’m sure some people have thought of this, and my hat is off to all of them.
To be clear, this doesn’t make all of alignment easy. As I say in Point 10. But I think it drastically improves our odds.
To my embarrassment, neither I nor anyone else I know in the community saw this coming.
This is false. Many people realized LLMs could be used to create agents. You should read the simulators post.
Interpretability research is also still important even if you prompt the model into explaining what it’s doing, because there is probably going to be a mismatch in the limit between what the model is printing and what its actual cognitive processes are.
Again, we did see agentizing coming. I did and I’m sure tons of other people did too. No disagreement there. In addition to the alignment upsides, what we (me and everyone I’ve read) didn’t see is the other cognitive enhancements easily available with an outer-loop script. I have little doubt that someone else saw more than I did, but it doesn’t seem to have made it into the collective dialogue. Perhaps that was for info hazard reasons, and I congratulate anyone who saw and held their tongue. I will clarify my argument for important cognitive additions beyond agentizing in a post I’m working on now. AutoGPT has one, but there are others that will come quickly now.
I did read the simulators post. I agree that interpretability research is still important, but looks to be very different than most of what’s been done to date if this new approach to AGI takes off.
I agree that interpretability research is still important, but looks to be very different than most of what’s been done to date if this new approach to AGI takes off.
Because a lot of interpretability will be about parsing gigantic internal trains of thought in natural language. This will probably demand sophisticated AI tools to aid it. Some of it will still be about decoding the representations in the LLM giving rise to that natural language. And there will be a lot of theory and experiment about what will cause the internal representation to deviate in meaning from the linguistic output. See this insightful comment by Gwern on pressures for LLMs to use steganography to make codes in their output. I suspect there are other pressures toward convoluted encoding and outright deception that I haven’t thought of. I guess that’s not properly considered interpretability, but it will be closely entwined with interpretability work to test those theories.
Those hurdles for interpretability research exist whether or not someone is using AutoGPT to run the LLM. My question is why you think the interpretability research done so far is less useful, because people are prompting the LLM to act agentically directly instead of {some other thing}.
The interpretability research done so far is still important, and we’ll still need more and better of the same, for the reason you point out. The natural language outputs aren’t a totally trustworthy indicator of the semantics underneath. But they are a big help and a new challenge for interpretability.
My comment will be vague because I’m not sure how much permission I have to share this or if it’s been publicly said somewhere and I’m just unaware, but I talked to an AI researcher at one of the major companies/labs working on things like LLMs several years ago, before even GPT-1 was out, and they told me that your reason 10 was basically their whole reason for wanting to work on language models.
Good for them! I’m really happy some of them saw this coming. To my embarrassment, neither I nor anyone else I know in the community saw this coming. I did see self-talk wrappers for LLMs as a way to give them agency; I haven’t said anything since it could be an infohazard for capabilities. But I didn’t notice how easy that would make initial alignment, or I would’ve been shouting about it. I’m sure some people have thought of this, and my hat is off to all of them.
To be clear, this doesn’t make all of alignment easy. As I say in Point 10. But I think it drastically improves our odds.
This is false. Many people realized LLMs could be used to create agents. You should read the simulators post.
Interpretability research is also still important even if you prompt the model into explaining what it’s doing, because there is probably going to be a mismatch in the limit between what the model is printing and what its actual cognitive processes are.
Again, we did see agentizing coming. I did and I’m sure tons of other people did too. No disagreement there. In addition to the alignment upsides, what we (me and everyone I’ve read) didn’t see is the other cognitive enhancements easily available with an outer-loop script. I have little doubt that someone else saw more than I did, but it doesn’t seem to have made it into the collective dialogue. Perhaps that was for info hazard reasons, and I congratulate anyone who saw and held their tongue. I will clarify my argument for important cognitive additions beyond agentizing in a post I’m working on now. AutoGPT has one, but there are others that will come quickly now.
I did read the simulators post. I agree that interpretability research is still important, but looks to be very different than most of what’s been done to date if this new approach to AGI takes off.
Why?
Because a lot of interpretability will be about parsing gigantic internal trains of thought in natural language. This will probably demand sophisticated AI tools to aid it. Some of it will still be about decoding the representations in the LLM giving rise to that natural language. And there will be a lot of theory and experiment about what will cause the internal representation to deviate in meaning from the linguistic output. See this insightful comment by Gwern on pressures for LLMs to use steganography to make codes in their output. I suspect there are other pressures toward convoluted encoding and outright deception that I haven’t thought of. I guess that’s not properly considered interpretability, but it will be closely entwined with interpretability work to test those theories.
Those hurdles for interpretability research exist whether or not someone is using AutoGPT to run the LLM. My question is why you think the interpretability research done so far is less useful, because people are prompting the LLM to act agentically directly instead of {some other thing}.
The interpretability research done so far is still important, and we’ll still need more and better of the same, for the reason you point out. The natural language outputs aren’t a totally trustworthy indicator of the semantics underneath. But they are a big help and a new challenge for interpretability.