It makes it easier, but consider this: The human brain also does this—when we conform to expectations, we make ourselves more predictable and model ourselves. But this also doesn’t prevent deception. People still lie and some of the deception is pushed into the subconscious.
Sure it doesn’t prevent a deceptive model being made, but if AI engineers made NN with such self awareness at all levels from the ground up, that wouldn’t happen in their models. The encouraging thing if it holds up is that there is little to no “alignment tax” to make the models understandable—they are also better.
It makes it easier, but consider this: The human brain also does this—when we conform to expectations, we make ourselves more predictable and model ourselves. But this also doesn’t prevent deception. People still lie and some of the deception is pushed into the subconscious.
Sure it doesn’t prevent a deceptive model being made, but if AI engineers made NN with such self awareness at all levels from the ground up, that wouldn’t happen in their models. The encouraging thing if it holds up is that there is little to no “alignment tax” to make the models understandable—they are also better.
Indeed, engineering readability at multiple levels may solve this.