Very interesting. I’m stuck on the argument about truthfulness being hard because the concept of truth is somehow fraught or too complicated. I’m envisaging an objection based on the T-schema (‘<p> is true iff p’).
Nate writes:
Now, in real life, building a truthful AGI is much harder than building a diamond optimizer, because ‘truth’ is a concept that’s much more fraught than ‘diamond’. (To see this, observe that the definition of “truth” routes through tricky concepts like “ways the AI communicated with the operators” and “the mental state of the operators”, and involves grappling with tricky questions like “what ways of translating the AI’s foreign concepts into human concepts count as manipulative?” and “what can be honestly elided?”, and so on, whereas diamond is just carbon atoms bound covalently in tetrahedral lattices.)
(end of quote)
But this reference to “the definition of ‘truth’” seems to presuppose some kind of view, where I’m not sure what that view is, but know it’s definitely going to be philosophically controversial.
Some think that ‘true’ can be defined by taking all the instances of the T-schema, or a (perhaps restricted) universal generalisation of it.
And this seems not totally crazy or irrelevant from an AI design perspective, at least at first blush. I feel I can sort of imagine an AI obeying a rule which says to assert <p> only if p.
Trying to envisage problems and responses, I hit the idea that the AI would have degrees of belief or credences, and not simply a list of things it thinks are true simpliciter. But perhaps it can have both. And perhaps obeying the T-schema based truthfulness rule would just lead it to confine most of its statements to statements about its own credences or something like that.
I think I see a separate problem about ensuring the AI does not (modify itself in order to) violate the T-schema based truthfulness rule. But that seems different at least from the supposed problem in the OP about the definition of ‘true’ being fraught or complicated or something.
If it wasn’t already clear I’m a philosophy person, not an alignment expert, but I follow alignment with some interest.
Very interesting. I’m stuck on the argument about truthfulness being hard because the concept of truth is somehow fraught or too complicated. I’m envisaging an objection based on the T-schema (‘<p> is true iff p’).
Nate writes:
Now, in real life, building a truthful AGI is much harder than building a diamond optimizer, because ‘truth’ is a concept that’s much more fraught than ‘diamond’. (To see this, observe that the definition of “truth” routes through tricky concepts like “ways the AI communicated with the operators” and “the mental state of the operators”, and involves grappling with tricky questions like “what ways of translating the AI’s foreign concepts into human concepts count as manipulative?” and “what can be honestly elided?”, and so on, whereas diamond is just carbon atoms bound covalently in tetrahedral lattices.)
(end of quote)
But this reference to “the definition of ‘truth’” seems to presuppose some kind of view, where I’m not sure what that view is, but know it’s definitely going to be philosophically controversial.
Some think that ‘true’ can be defined by taking all the instances of the T-schema, or a (perhaps restricted) universal generalisation of it.
And this seems not totally crazy or irrelevant from an AI design perspective, at least at first blush. I feel I can sort of imagine an AI obeying a rule which says to assert <p> only if p.
Trying to envisage problems and responses, I hit the idea that the AI would have degrees of belief or credences, and not simply a list of things it thinks are true simpliciter. But perhaps it can have both. And perhaps obeying the T-schema based truthfulness rule would just lead it to confine most of its statements to statements about its own credences or something like that.
I think I see a separate problem about ensuring the AI does not (modify itself in order to) violate the T-schema based truthfulness rule. But that seems different at least from the supposed problem in the OP about the definition of ‘true’ being fraught or complicated or something.
If it wasn’t already clear I’m a philosophy person, not an alignment expert, but I follow alignment with some interest.