Jozdien comments on Difficulty classes for alignment properties

Jozdien 21 Feb 2024 7:56 UTC
3 points
0
Thanks for the comment, I’m glad it helped!
I’m not sure if I know exactly what parts you feel fuzzy on, but some scattered thoughts:
Abstracting over a lot of nuance and complexity, one could model internal optimization as being a ~general-purpose search process / module that the model can make use of. A general-purpose search process requires a goal to evaluate the consequences of different plans that you’re searching over. This goal is fed into the search module as an input.
This input is probably described in the model’s internal language; i.e., it’s described in terms of concepts that the model learns corresponding to things in the environment. This seems true even if the model uses some very direct pointer to things in the environment—it still has to be represented as information that makes sense to the search process, which is written in the model’s ontology.
So the inputs to the search process are part of the system itself. Which is to say that the “property” of the optimization that corresponds to what it’s targeted at, is in the complexity class of the system that the optimization is internal to. I think this generalizes to the case where the optimization isn’t as cleanly represented as a general-purpose search module.
- Micurie 23 Feb 2024 12:49 UTC
  1 point
  0
  Parent
  Thank you for your answer! You clarified my confusion!
  I would be interested to know more about your concept of (inner) optimization in its full complexity and nuances. I would really appreciate it if you could point me to any previous writings, regarding this.
  My previous reads on this topic include this post from Yudkowsky and this post from Flint where (to the best of my understanding) an optimizing system evolves according to some preference ordering that has a low probability of occurring spontaneously. I find their definitions to be a bit more general than the one you are referring to here (please correct me if I am wrong).
  I am curious about the above because I am currently working on a project related to this topic. I am interested in formalizing some concepts regarding optimizers and their potential evolution towards agentic structure in some limit with rigorous math.