Gunnar_Zarncke comments on Self-Reference Breaks the Orthogonality Thesis

Gunnar_Zarncke 17 Feb 2023 11:30 UTC
4 points
0
Three thoughts:
1. If you set up the system like that, you may run into the mentioned problems. It might be possible wrap both into a single model that is trained together.
2. An advanced system may reason about the joint effect, e.g. by employing fixed-point theorems and Logical Induction.
3. Steven Byrne’s [Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL models humans as having three components:
  1. world model that is mainly trained by prediction error
  2. a steering system that encodes preferences over world states
  3. a system that learns how world model predictions relate to steering system feedback