I’ve actually worked on a toy model related to producing alignment researchers, although the visionary-focused angle is a new one and I’d like to update it so it puts more emphasis on sourcing high-impact individuals and less on larger numbers of better rank-and-file workers. I’m definitely interested in improving it, especially if it’s wildly inaccurate.
My understanding is that attempting to find people who have high output per year requires knowing the ratio of Convenient Champions to Skilled Students to Unassuming Underdogs to Average Americans. Then, after finding their relative frequency in the pool, the hard part is during the evaluation period, where it’s a question of how good you can get at distinguishing Skilled Students and Convenient Champions, who learn math quickly and distinguish themselves on their own, and how well you can distinguish Unassuming Underdogs and Average Americans, who learn new kinds of quantitative thinking more slowly but the unassuming underdogs end up very skilled at applying them and end up as high-impact champions after a long waiting period.
My understanding of the current strategy described here is that we wait for people to get good at alignment on their own, even though there’s only ~300 people on earth who end up working on tasks that vaguely sound like AI safety. This implies that we have a pool of <500 people currently skilling up to work on AI safety (probably significantly less), and we… just… wait for geniuses to figure things out on their own, get good at distinguishing themselves on their own, do everything right, pop out of the pool on their own, all with zero early detection or predictive analytics, and we got ~5 people that way? Even though there’s like 200,000 people on earth with +4 SD? And we’re just crossing our fingers hoping for summoned heroes to show up at berkeley and save the day?
Also look at the inverse direction. Right now, we finally found a trick that works well enough to form a useful AI. And it has serious issues (with hallucination, being threatening or creepy, or making statements that disagree with a web search or writing code that doesn’t compile and work). So to fix those issues, now that we know about them, there are straightforward things to try to fix the issues.*
...Would an alignment genius of any stripe, without knowing the results, have been able to predict the issues current models have and design algorithms for the countermeasures?
This is a testable query. Did anyone in the alignment field predict (1) the current problems with hallucination** (2) propose a solution?
*recursion. It appears that current models can examine text that happens to be their own output and can be prompted to inspect it for errors in logic and fact. This suggests a way to fix this particular issue with current scale llms.
**how did they do it. The field of mathematics doesn’t cover complex networks of arbitrary learned functions hallucinating information does it...
Another point: for fission, there are high performance reactor designs that can never be safe. Possibly for AGI this is the same, that only very restricted, specific designs are mostly safe, and anything else is not. There may not exist a general alignment solution. There is not for fission, and high performance reactors (like nuclear salt water rocket engines) are absurdly unsafe.
For fission it also took years and thousands of people working on it, armed with data from previous reactors and previous meltdowns, before they arrived at Mostly Safe designs.
I’ve actually worked on a toy model related to producing alignment researchers, although the visionary-focused angle is a new one and I’d like to update it so it puts more emphasis on sourcing high-impact individuals and less on larger numbers of better rank-and-file workers. I’m definitely interested in improving it, especially if it’s wildly inaccurate.
My understanding is that attempting to find people who have high output per year requires knowing the ratio of Convenient Champions to Skilled Students to Unassuming Underdogs to Average Americans. Then, after finding their relative frequency in the pool, the hard part is during the evaluation period, where it’s a question of how good you can get at distinguishing Skilled Students and Convenient Champions, who learn math quickly and distinguish themselves on their own, and how well you can distinguish Unassuming Underdogs and Average Americans, who learn new kinds of quantitative thinking more slowly but the unassuming underdogs end up very skilled at applying them and end up as high-impact champions after a long waiting period.
My understanding of the current strategy described here is that we wait for people to get good at alignment on their own, even though there’s only ~300 people on earth who end up working on tasks that vaguely sound like AI safety. This implies that we have a pool of <500 people currently skilling up to work on AI safety (probably significantly less), and we… just… wait for geniuses to figure things out on their own, get good at distinguishing themselves on their own, do everything right, pop out of the pool on their own, all with zero early detection or predictive analytics, and we got ~5 people that way? Even though there’s like 200,000 people on earth with +4 SD? And we’re just crossing our fingers hoping for summoned heroes to show up at berkeley and save the day?
Also look at the inverse direction. Right now, we finally found a trick that works well enough to form a useful AI. And it has serious issues (with hallucination, being threatening or creepy, or making statements that disagree with a web search or writing code that doesn’t compile and work). So to fix those issues, now that we know about them, there are straightforward things to try to fix the issues.*
...Would an alignment genius of any stripe, without knowing the results, have been able to predict the issues current models have and design algorithms for the countermeasures?
This is a testable query. Did anyone in the alignment field predict (1) the current problems with hallucination** (2) propose a solution?
*recursion. It appears that current models can examine text that happens to be their own output and can be prompted to inspect it for errors in logic and fact. This suggests a way to fix this particular issue with current scale llms.
**how did they do it. The field of mathematics doesn’t cover complex networks of arbitrary learned functions hallucinating information does it...
Another point: for fission, there are high performance reactor designs that can never be safe. Possibly for AGI this is the same, that only very restricted, specific designs are mostly safe, and anything else is not. There may not exist a general alignment solution. There is not for fission, and high performance reactors (like nuclear salt water rocket engines) are absurdly unsafe.
For fission it also took years and thousands of people working on it, armed with data from previous reactors and previous meltdowns, before they arrived at Mostly Safe designs.