Thane Ruthenis comments on Corrigibility Via Thought-Process Deference

Thane Ruthenis 28 Nov 2022 19:01 UTC
LW: 1 AF: 1
0
AF
I’d rather sacrifice even more corrigibility properties (like how this already isn’t too worried about subagent stability) for better friendliness
Do you have anything specific in mind?
- Charlie Steiner 29 Nov 2022 6:09 UTC
  LW: 2 AF: 1
  0
  AF Parent
  One thing might be that I’d rather have an AI design that’s more naturally self-reflective, i.e. using its whole model to reason about itself, rather than having pieces that we’ve manually retargeted to think about some other pieces. This reduces how much Cartesian doubt is happening on the object level all at the same time, which sorta takes the AI farther away from the spec. But this maybe isn’t that great an example, because maybe it’s more about not endorsing the “retargeting the search” agenda.