AnthonyC comments on On the Confusion between Inner and Outer Misalignment

AnthonyC 25 Mar 2024 14:22 UTC
4 points
0
I’m definitely one of those non-experts who has never done actual machine learning, but AFAICT that article you linked both is tied to and does not explicitly mentioned that the ‘principle of indifference’ is about the epistemological taste of the reasoner, while arguing that the cases where the reasoner lacks knowledge to hold a more accurate prior means the principle itself is wrong.
The training of an LLM is not a random process, therefore indifference will not accurately predict the outcome of this process. This does not imply anything about other forms of AI, or about whether people reasoning in the absence of knowledge about the training process were making a mistake. It also does not imply sufficient control over the outcome of the training process to ensure that the LLM will, in general, want to do what we want it to want to do, let alone to do what we want it to do.
The section where she talks about how evolution’s goals are human abstractions and an LLM’s training has a well-specified goal in terms of gradient descent is really where that argument loses me, though. In both cases, it’s still not well specified, a priori, how the well-defined processes cash out in terms of real-world behavior. The factual claims are true enough, sure. But the thing an LLM is trained to do is predict what comes next, based on training data curated by humans, and humans do scheme. Therefore, a sufficiently powerful LLM should, by default, know how to scheme, and we should assume there are prompts out there in prompt-space that will call forth that capability. No counting argument needed. In fact, the article specifically calls this out, saying the training process is “producing systems that behave the right way in all scenarios they are likely to encounter,” which means the behavior is unspecified in whatever scenarios the training process deems “unlikely,” although I’m unclear what “unlikely” even means here or how it’s defined.
One of the things we want from our training process is to not have scheming behavior get called up in a hard-to-define-in-advance set of likely and unlikely cases. In that sense, inner-alignment may not be a thing for the structure of LLMs, in that the LLM will automatically want what it is trained to want. But, it is still the case that we don’t know how to do outer-alignment for a sufficiently general set of likely scenarios, aka we don’t actually know precisely what behavioral responses our training process is instilling.
- AnthonyC 25 Mar 2024 14:25 UTC
  2 points
  0
  Parent
  Chris’ latest reply to my other comment resolved a confusion I had, so I now realize my comment above isn’t actually talking about the same thing as you.