mesaoptimizer comments on Bing Chat is a Precursor to Something Legitimately Dangerous

mesaoptimizer 1 Mar 2023 15:32 UTC
3 points
0

by effectively generating such datasets, either for specific skills or for everything all at once

Just to be clear, what you have in mind is something to the effect of chain-of-thought (where LLMs and people deliberate through problems instead of trying to get an answer immediately or in the next few tokens), but in a more roundabout fashion, where you make the LLM deliberate a lot and fine-tune the LLM on that deliberation so that its “in the moment” (aka next token) response is more accurate—is that right?

If so, how would you correct for the hallucinatory nature of LLMs? Do they even need to be corrected for?

Since this is a capabilities-only discussion, feel free to either not respond or take it private. I just found your claim interesting since this is the first time I encountered such an idea.
- Vladimir_Nesov 2 Mar 2023 4:02 UTC
  4 points
  0
  Parent
  Chain-of-thought for particular skills, with corrections of mistakes, to produce more reliable/appropriate chains-of-thought where it’s necessary to take many steps, and to arrive at the answer immediately when it’s possible to form intuition for doing that immediately. Basically doing your homework, for any topic where you are ready to find or make up and solve exercises, with some correction-of-mistakes and guessed-correctly-but-checked-just-in-case overhead, for as many exercises as it takes. The result is a dataset with enough worked exercises, presented in a form that lets SSL extract the skill of more reliably doing that thing, and to calibrate on how much it needs to chain-of-thought a thing to do it correctly.
  
  A sufficiently intelligent and coherent LLM character that doesn’t yet have a particular skill would be able to follow the instructions and complete such tasks for arbitrary skills it’s ready to study. I’m guessing ChatGPT is already good enough for that, but Bing Chat shows that it could become even better without new developments. Eventually there is a “ChatGPT, study linear algebra” routine that produces a ChatGPT that can do linear algebra (or a dataset for a pretrained GPT-N to learn linear algebra out of the box), after expending some nontrivial amount of time and compute, but crucially without any other human input/effort. And the same routine works for all other topics, not just linear algebra, provided they are not too advanced to study for the current model.
  
  So this is nothing any high schooler isn’t aware of, not much of a capability discussion. There are variants that look differently and are likely more compute-efficient, or give other benefits at the expense of more misalignment risk (because involve data further from human experience, might produce something that’s less of a human imitation), this is just the obvious upper-bound-on-difficulty variant.
  
  But also, this is the sort of capability idea that doesn’t destroy the property of LLM characters being human imitations, and more time doesn’t just help with alignment, but also with unalignable AGIs. LLM characters with humane personality are the only plausible-in-practice way to produce direct, if not transitive alignment, that I’m aware of. Something with the same alignment shortcomings as humans, but sufficiently different that it might still change things for the better.
  What links here?