In a causal-masked transformer, attention layers can query the previous layers’ activations from any column in the context window. Gradients flow through the attention connections, so each previous layer is optimized not just to improve prediction accuracy for the next token, but also to produce values that are useful for future columns to attend to when predicting their token.
I think this is part of the reason why prompt engineering is so fiddly.
GPT essentially does a limited form of branch prediction and speculative execution. It guesses (based on the tokens evaluated so far) what pre-computation will be useful for future token predictions. If its guess is wrong, the pre-computation will be useless.
Prompts lets you sharpen the superposition of simulacra before getting to the input text, improving the quality of the branch prediction. However, the exact way that the prompt narrows down the simulacra can be pretty arbitrary so it requires lots of random experimentation to get right.
Ideally, at the end of the prompt the implicit superposition of simulacra should match your expected probability distribution over simulacra that generated your input text. The better the match, the more accurate the branch prediction and speculative execution will be.
But you can’t explicitly control the superposition and you don’t really know the distribution of your input text so… fiddly.
It is possible to modify the transformer architecture to enforce value (prediction accuracy) myopia by placing stop gradients in the attention layers. This effectively prevents past activations from being directly optimized to be more useful for future computation.
I think that enforcing this constraint might make interpretability easier. The pre-computation that transformers do is indirect, limited, and strange. Each column only has access to the non-masked columns of the previous residual block, rather than access to the non-masked columns of all residual blocks or even just access to the non-masked columns of all previous residual blocks.
Maybe RNNs like RWKV with full hidden state access are easier to interpret?
A consequence-blind simulator that predicts power-seeking agents (like humans) will still predict actions which seek power, but these actions will seek power for the simulated agent, not the simulator itself. I usually think about problems like this as simulator vs simulacra alignment. If you successfully build an inner aligned simulator, you can use it to faithfully simulate according to the rules it learns and generalizes from its training distribution. However you are still left with the problem of extracting consistently aligned simulacra.
This is concerning because it’s not at all clear what a model that is predicting itself should output. It breaks many of the intuitions of why it should be safe to use LLMs as simulators of text distributions.
Doesn’t Anthropic’s Constitutional AI approach do something similar? They might be familiar with the consequences from their work on Claude.
I think this is part of the reason why prompt engineering is so fiddly.
GPT essentially does a limited form of branch prediction and speculative execution. It guesses (based on the tokens evaluated so far) what pre-computation will be useful for future token predictions. If its guess is wrong, the pre-computation will be useless.
Prompts lets you sharpen the superposition of simulacra before getting to the input text, improving the quality of the branch prediction. However, the exact way that the prompt narrows down the simulacra can be pretty arbitrary so it requires lots of random experimentation to get right.
Ideally, at the end of the prompt the implicit superposition of simulacra should match your expected probability distribution over simulacra that generated your input text. The better the match, the more accurate the branch prediction and speculative execution will be.
But you can’t explicitly control the superposition and you don’t really know the distribution of your input text so… fiddly.
I think that enforcing this constraint might make interpretability easier. The pre-computation that transformers do is indirect, limited, and strange. Each column only has access to the non-masked columns of the previous residual block, rather than access to the non-masked columns of all residual blocks or even just access to the non-masked columns of all previous residual blocks.
Maybe RNNs like RWKV with full hidden state access are easier to interpret?
Agreed. Gwern’s short story “It Looks Like You’re Trying To Take Over The World” sketches a takeover scenario by a simulacrum.
Doesn’t Anthropic’s Constitutional AI approach do something similar? They might be familiar with the consequences from their work on Claude.