I’m a fairly new alignment researcher, although I’ve been following LessWrong and MIRI’s writing, etc for many years.
My approach so far is trying to understand and use interpretability tools (e.g. learning from Olah’s circuits team & Buck at Redwood Research, and others who aren’t themselves trying to tackle the ‘hard problem’ but are coming up with useful stuff like ROME https://rome.baulab.info/ ) to find and control natural abstractions (ala John Wentworth) in a variety of model architectures (starting with transformers).
An important additional aspect is that I think it is plausible to edit a variety of architectures in some generalizable ways to partially separate them into ‘modules’, based around natural dividing points (from natural abstractions). Something like a hybrid between a singular black box and a mixture of experts. I’d expect this refactoring to result in some ‘alignment tax’ on capabilities, but not as extreme as going to full silo-ed narrow experts.
One of the goals along the way should be to try to fundamentally understand general intelligence, to be able to better detect and control it.
My main self-criticism: too slow. I think my approach has a pretty good chance given a team of 10 people at least as smart and productive as me, and about 50 years to work on the problem. I super don’t think we have anywhere near 50 years, more like 10. Also, I’m not a member of a team of 10 people working on the same idea, I’m new and alone.
Response: My hope is to join a team working on some similar set of goals, and thereby be usefully contributing towards a solution. Also, my hope is to convince people that this problem is big enough and urgent enough that we should even more rapidly scale up the numbers of smart capable people working on it.
I’m a fairly new alignment researcher, although I’ve been following LessWrong and MIRI’s writing, etc for many years.
My approach so far is trying to understand and use interpretability tools (e.g. learning from Olah’s circuits team & Buck at Redwood Research, and others who aren’t themselves trying to tackle the ‘hard problem’ but are coming up with useful stuff like ROME https://rome.baulab.info/ ) to find and control natural abstractions (ala John Wentworth) in a variety of model architectures (starting with transformers).
An important additional aspect is that I think it is plausible to edit a variety of architectures in some generalizable ways to partially separate them into ‘modules’, based around natural dividing points (from natural abstractions). Something like a hybrid between a singular black box and a mixture of experts. I’d expect this refactoring to result in some ‘alignment tax’ on capabilities, but not as extreme as going to full silo-ed narrow experts.
One of the goals along the way should be to try to fundamentally understand general intelligence, to be able to better detect and control it.
My main self-criticism: too slow. I think my approach has a pretty good chance given a team of 10 people at least as smart and productive as me, and about 50 years to work on the problem. I super don’t think we have anywhere near 50 years, more like 10. Also, I’m not a member of a team of 10 people working on the same idea, I’m new and alone.
Response: My hope is to join a team working on some similar set of goals, and thereby be usefully contributing towards a solution. Also, my hope is to convince people that this problem is big enough and urgent enough that we should even more rapidly scale up the numbers of smart capable people working on it.