Seth Herd comments on The alignment stability problem

Seth Herd 26 Mar 2023 16:15 UTC
1 point
1
Reflective stability does seem like the right term. Searches on that term are turning up some relevant discussion on alignment forum, so thanks!

Tiling agent theory is about formal proof of goal consistency in successor agents. I don’t think that’s relevant for any AGI made of neural networks similar to either brains or current systems. And that’s a problem.

Reflective consistency looks to be about decision algorithms given beliefs, so I don’t think that’s directly relevant. I couldn’t work out Yudkowsky’s use of reflectively coherent quantified belief on a quick look; but it’s in service of that closed form proof. That term only occurs three times on AF. Reflective trust is about internal consistency and decision processes relative to beliefs and goals, and it also doesn’t seem to have caught on as common terminology.

So the reflective stability term is what I’m looking for, and should turn up more related work. Thanks!