Gerald Monroe comments on POC || GTFO culture as partial antidote to alignment wordcelism

Gerald Monroe 15 Mar 2023 12:28 UTC
5 points
2
I agree just think that probably virtually all of the ‘big’ issues talked about are not possible with current models. Including mesa optimizers. Architecturally they may not be achievable in the search space of “find the function parameters that minimize error on <this enormous amount of text, or this enormous amount of robotics problems>”.
Deception theoretically has a cost, and the direction of optimization would push against it, you’re asking for the smallest representation that correctly predicts the output. So at least with these forms of training + architectures (transformer variants, both for llms and robotics), this particular flaw May. Not. Happen.
It’s precisely what you were saying with your example, the actual compiler flaws are both different and as it turns out way worse. (“Sydney” wasn’t a mesa optimizer, it’s channeling a character that exists somewhere in the training corpus. The model was Working As Intended)
- seed 1 Feb 2024 3:05 UTC
  2 points
  0
  Parent
  Didn’t they demonstrate that transformers could be mesaoptimizers? (I never properly understood the paper, so it’s a genuine question.) Uncovering Mesaoptimization Algorithms in Transformers
  - Gerald Monroe 1 Feb 2024 5:36 UTC
    3 points
    0
    Parent
    From the paper:
    Motivated by our findings that attention layers are attempting to implicitly optimize internal objective functions, we introduce the mesa-layer, a novel attention layer that efficiently solves a least-squares optimization problem, instead of taking just a single gradient step towards an optimum. We show that a single mesa-layer outperforms deep linear and softmax self-attention Transformers on simple sequential tasks while offering more interpretability
    It looks like you can analyze transformers, discover the internal patterns that emergently are formed, analyze which ones work the best, and then redesign your network architecture to start with an extra layer that has this pattern already present.
    Not only is this closer to the human brain, but yes, it’s adding a type of internal mesa optimizer. Doing it deliberately instead of letting one form emergently from the data probably prevents the failure mode AI doomers are worried about, this layer allowing the machine to defect against humans.