Roman Leventov comments on Types and Degrees of Alignment

Roman Leventov 1 Jun 2023 15:07 UTC
1 point
0
Strict alignment. The damn thing will actually follow some set of instructions to the letter subject to its optimization constraints, hopefully you like the consequences of that. It is a potentially important crux if you disagree with the claim that for almost all specified instruction sets you won’t like the consequences, and there is no known good one yet, due to various alignment difficulties.
The name is uninformative and possibly misleading. If the set of instructions is in a natural or a formal language, you push the alignment difficulty into the semantics and semiotics, which are not “strict”, and the alignment ends up not “strict” either.
In the planning-as-inference frame, I guess you probably mean something like an external evaluation (perhaps with some “good old-fashioned algorithm”, a-la “type checker”, rather than another AI, although it’s really questionable whether such “type checker” could be built) of the inferred plans in their entirety. But again, even internal AI’s representations are symbols, not “ground truth”, so they are subject to the same difficulty of semantic and semiotic interpretation as natural language.
It is not clear to what extent strict alignment or strawberry alignment gives us affordance to reach good outcomes, how universal and deadly the various sources of lethality involved would be, or how difficult it would be to locate such affordances, especially on the first try.
“Strict” alignment is an engineering technique or approach, that shouldn’t be judged in isolation but rather as a part of a cognitive architecture as a whole, as I explained in this comment.
“Strawberry” alignment is an external evaluation criterion or characteristic. However, I would go even further and say that “strawberry” is just a thought experiment which is not meant to be an actual eval that we will actually try, and the purpose of this thought experiment is just to show that the process of reasoning (and the result of reasoning, i.e., the plan), and the resulting behaviour are the actual objects of ethical evaluation and alignment rather than simply “goals”. Goals become “good” or “bad” only in the context of larger plans and behaviour. This thought could be expressed in different ways, e.g., directly, as I just did (as well as in this comment). The latest OpenAI paper, “Let’s Verify Step By Step” highlights this idea, too.