There are all sorts of “strategies” (turn it off, raise it like a kid, disincentivize changing the environment, use a weaker AI to align it) that people come up with when they’re new to the field of AI safety, but that are ineffective. And their ineffectiveness is only obvious and explainable by people who specifically know how AI behaves.
yep but the first three all fail for the shared reason of “programs will do what they say to do, including in response to your efforts”. (the fourth one, ‘use a weaker AI to align it’, is at least obviously not itself a solution. the weakest form of it, using an LLM to assist an alignment researcher, is possible, and some less weak forms likely are too.)
when i think of other ‘newly heard of alignment’ proposals, like boxing, most of them seem to fail because the proposer doesn’t actually have a model of how this is supposed to work or help in the first place. (the strong version of ‘use ai to align it’ probably fits better here)
(there are some issues which a programmatic model doesn’t automatically make obvious to a human: they must follow from it, but one could fail to see them without making that basic mistake. probable environment hacking and decision theory issues come to mind. i agree that on general priors this is some evidence that there are deeper subjects that would not be noticed even conditional on those researchers approving a solution.)
i guess my next response then would be that some subjects are bounded, and we might notice (if not ‘be able to prove’) such bounds telling us ‘theres not more things beyond what you have already written down’, which would be negative evidence (strength depending on how strongly we’ve identified a bound). (this is more of an intuition, i don’t know how to elaborate this)
(also on what johnswentworth wrote: a similar point i was considering making is that the question is set up in a way that forces you into playing a game of “show how you’d outperform magnus carlsen {those researchers} in chess alignment theory”—for any consideration you can think of, one can respond that those researchers will probably also think of it, which might preclude them from actually approving, which makes the conditional ‘they approve but its wrong’[1] harder to be true and basically dependent on them instead of object-level properties of alignment.)
i am interested in reading more arguments about the object-level question if you or anyone else has them.
If the solution to alignment were simple, we would have found it by now [...] That there is one simple thing from which comes all of our values, or a simple way to derive such a thing, just seems unlikely.
the pointer to values does not need to be complex (even if the values themselves are)
If the solution to alignment were simple, we would have found it by now
generally: simple things don’t have to be easy to find. the hard part can be locating them within some huge space of possible things. (math (including is use in laws of physics) come to mind?). (and specifically to alignment: i also strongly expect an alignment solution to … {have some set of simple principles from which it can be easily derived (i.e. whether the program itself ends up long)}, but idk if i can legibly explain why. real complexity usually results from stochastic interactions in a process, but “aligned superintelligent agent” is a simply-defined, abstract thing?)
ig you actually wrote ‘they dont notice flaws’, which is ambiguously between ‘they approve’ and ‘they don’t find affirmative failure cases’. and maybe the latter was your intent all along.
it’s understandable because we do have to refer to humans to call something unintuitive.
yep but the first three all fail for the shared reason of “programs will do what they say to do, including in response to your efforts”. (the fourth one, ‘use a weaker AI to align it’, is at least obviously not itself a solution. the weakest form of it, using an LLM to assist an alignment researcher, is possible, and some less weak forms likely are too.)
when i think of other ‘newly heard of alignment’ proposals, like boxing, most of them seem to fail because the proposer doesn’t actually have a model of how this is supposed to work or help in the first place. (the strong version of ‘use ai to align it’ probably fits better here)
(there are some issues which a programmatic model doesn’t automatically make obvious to a human: they must follow from it, but one could fail to see them without making that basic mistake. probable environment hacking and decision theory issues come to mind. i agree that on general priors this is some evidence that there are deeper subjects that would not be noticed even conditional on those researchers approving a solution.)
i guess my next response then would be that some subjects are bounded, and we might notice (if not ‘be able to prove’) such bounds telling us ‘theres not more things beyond what you have already written down’, which would be negative evidence (strength depending on how strongly we’ve identified a bound). (this is more of an intuition, i don’t know how to elaborate this)
(also on what johnswentworth wrote: a similar point i was considering making is that the question is set up in a way that forces you into playing a game of “show how you’d outperform
magnus carlsen{those researchers} inchessalignment theory”—for any consideration you can think of, one can respond that those researchers will probably also think of it, which might preclude them from actually approving, which makes the conditional ‘they approve but its wrong’[1] harder to be true and basically dependent on them instead of object-level properties of alignment.)i am interested in reading more arguments about the object-level question if you or anyone else has them.
the pointer to values does not need to be complex (even if the values themselves are)
generally: simple things don’t have to be easy to find. the hard part can be locating them within some huge space of possible things. (math (including is use in laws of physics) come to mind?). (and specifically to alignment: i also strongly expect an alignment solution to … {have some set of simple principles from which it can be easily derived (i.e. whether the program itself ends up long)}, but idk if i can legibly explain why. real complexity usually results from stochastic interactions in a process, but “aligned superintelligent agent” is a simply-defined, abstract thing?)
ig you actually wrote ‘they dont notice flaws’, which is ambiguously between ‘they approve’ and ‘they don’t find affirmative failure cases’. and maybe the latter was your intent all along.
it’s understandable because we do have to refer to humans to call something unintuitive.