TurnTrout comments on Preventing Language Models from hiding their reasoning

TurnTrout 10 Nov 2023 19:05 UTC
LW: 3 AF: 3
0
AF
Steganography could be used by AIs within a single short chain-of-thought to have more serial thinking time. This could include “thoughts” about the overseer and how to play the training game correctly. If this behavior is incentivized, it would make deceptive alignment more likely, since it would remove some of the pressure against deceptive alignment coming from speed priors.
I mean, IIUC the speed prior still cuts against this, since instead of thinking about training the network could just be allocating more capacity to doing the actual task it’s trained to do. That doesn’t seem to change with additional filler thinking tokens.
if planning, thinking, and communications in dense vector spaces become vastly more powerful than text-based planning, thinking, and communication, then the alignment tax from sticking with systems mostly relying on text will grow.
Doesn’t this already apply partially to the current work? Sure, you constrain the input embeddings to the valid token embedding vectors, but it also attends to the previous residual streams (and not just the tokens they output).
- ryan_greenblatt 10 Nov 2023 20:05 UTC
  LW: 8 AF: 5
  1
  AF Parent
  
  I mean, IIUC the speed prior still cuts against this, since instead of thinking about training the network could just be allocating more capacity to doing the actual task it’s trained to do. That doesn’t seem to change with additional filler thinking tokens.
  
  Agreed that speed prior still pushes against using a constant fraction of tokens to think about irrelevant stuff.
  
  However, I stand by “it would remove some of the pressure against deceptive alignment coming from speed prior”. In particular, pure forward pass reasoning has capped serial reasoning time with current architectures while tokens don’t which seems potentially important depending on various other considerations.
  
  Doesn’t this already apply partially to the current work? Sure, you constrain the input embeddings to the valid token embedding vectors, but it also attends to the previous residual streams (and not just the tokens they output).
  
  As in “current transformers already think in dense vector spaces”? The main thing is that the serial cognition time prior to some text bottleneck is bounded. I totally agree that transformers do a bunch of unseen cognition in a forward pass, but it seems possible to constrain/bound this within-forward-pass cognition in a bunch of useful ways. We probably can’t importantly constrain the overall cognitive labor an AI lab needs to do if that cognitive labor was done in an entirely uninterpretable way.
  
  (Let me know if this isn’t what you wre trying to talk about, I’m not super confident what you’re pointing at here.)