I haven’t listened to the whole interview, but it sounds like you might be reading more into it than is there.
Shane talked about the importance of checking the reasoning process given that reinforcement learning can lead to phenomena like deceptive alignment, but he didn’t explain exactly how he hopes to deal with this other than saying that the reasoning process has to be checked very carefully.
This could potentially tie to some proposals such as approval-based agents, interpretability or externalized reasoning, but it wasn’t clear to me how exactly he wanted to do this. Right now, you can ask an agent to propose a plan and provide a justification, but I’m sure he knows just how unreliable this is.
It’s not clear that he has a plan beyond “we’ll find a way” (this isn’t a quote).
But, as I said, I didn’t listen to the whole podcast, so feel free to let me know if I missed anything.
I think you’re right that I’m reading into this. But there is probably more to his thinking, whether I’m right or wrong about what that is. Shane Legg was thinking about alignment as far back as his PhD thesis, which doesn’t go into depth on it but does show he’d at least read a some of the literature prior to 2008.
I agree that LLM chain of thought is not totally reliable, but I don’t think it makes sense to dismiss it as too unreliable to work with for an alignment solution. There’s so much that hasn’t been tried, both in making LLMs more reliable, and making agents built on top of them reliable by taking multiple paths, and using new context windows and different models to force them to break problems into steps, and use the last natural language statement as their whole context for the next step.
Whether or not this is a reliable path to alignment, it’s a potential path to huge profits. So there are two questions: will this lead to alignable AGI? And, will it lead to AGI. I think both are unanswered.
I haven’t listened to the whole interview, but it sounds like you might be reading more into it than is there.
Shane talked about the importance of checking the reasoning process given that reinforcement learning can lead to phenomena like deceptive alignment, but he didn’t explain exactly how he hopes to deal with this other than saying that the reasoning process has to be checked very carefully.
This could potentially tie to some proposals such as approval-based agents, interpretability or externalized reasoning, but it wasn’t clear to me how exactly he wanted to do this. Right now, you can ask an agent to propose a plan and provide a justification, but I’m sure he knows just how unreliable this is.
It’s not clear that he has a plan beyond “we’ll find a way” (this isn’t a quote).
But, as I said, I didn’t listen to the whole podcast, so feel free to let me know if I missed anything.
I think you’re right that I’m reading into this. But there is probably more to his thinking, whether I’m right or wrong about what that is. Shane Legg was thinking about alignment as far back as his PhD thesis, which doesn’t go into depth on it but does show he’d at least read a some of the literature prior to 2008.
I agree that LLM chain of thought is not totally reliable, but I don’t think it makes sense to dismiss it as too unreliable to work with for an alignment solution. There’s so much that hasn’t been tried, both in making LLMs more reliable, and making agents built on top of them reliable by taking multiple paths, and using new context windows and different models to force them to break problems into steps, and use the last natural language statement as their whole context for the next step.
Whether or not this is a reliable path to alignment, it’s a potential path to huge profits. So there are two questions: will this lead to alignable AGI? And, will it lead to AGI. I think both are unanswered.