It seems likely that a highly competent LMA system will either be emergently reflective or be designed to do so, that prompt might be the strongest single influence on its goals/values, and so create an aligned goal that’s reflectively stable such that the agent actively avoids acting on emergent, unaligned goals or allowing its goals/values to drift into unaligned versions.
I expect it would likely work most of the time for reasons related to e.g. An Information-Theoretic Analysis of In-Context Learning, but likely not robustly enough given the stakes; so additional safety measures on top (e.g. like examples from the control agenda) seem very useful.
Interesting that you think it would work most of the time. I know you’re aware of all the major arguments for alignment being impossibly hard. I certainly am not arguing that alignment is easy, but it does seem like the collection of ideas for aligning language model agents are viable enough to shift the reasonable distribution of estimates of alignment difficulty...
This seems quite likely to emerge through prompting too, e.g. A Theoretical Understanding of Self-Correction through In-context Alignment.
I expect it would likely work most of the time for reasons related to e.g. An Information-Theoretic Analysis of In-Context Learning, but likely not robustly enough given the stakes; so additional safety measures on top (e.g. like examples from the control agenda) seem very useful.
Interesting that you think it would work most of the time. I know you’re aware of all the major arguments for alignment being impossibly hard. I certainly am not arguing that alignment is easy, but it does seem like the collection of ideas for aligning language model agents are viable enough to shift the reasonable distribution of estimates of alignment difficulty...
Thanks for the references, I’m reading them now.