On your first point, correct the thing shown to be uncomputable is testing alignment. And yes, uncomputability is a worst case claim. Would it be clearer to call the paper an uncomputable alignment TEST as opposed to an uncomputable alignment PROBLEM? (Im considering editing the paper before submitting it to a journal)
Detering a few would be nice. More realistically, proofs in this vain could help convince regulators to ignore opaque box makers claims about detecting an agent’s alignment.
I think that would help. I think the existing title primed me to expect something else, more in the line of it being impossible for an “aligned” program to exist because it couldn’t figure out what to do.
Or perhaps the direct-statement style “Aligned status of software is undecideable” or something like that.
On your first point, correct the thing shown to be uncomputable is testing alignment. And yes, uncomputability is a worst case claim. Would it be clearer to call the paper an uncomputable alignment TEST as opposed to an uncomputable alignment PROBLEM? (Im considering editing the paper before submitting it to a journal)
Detering a few would be nice. More realistically, proofs in this vain could help convince regulators to ignore opaque box makers claims about detecting an agent’s alignment.
I think that would help. I think the existing title primed me to expect something else, more in the line of it being impossible for an “aligned” program to exist because it couldn’t figure out what to do.
Or perhaps the direct-statement style “Aligned status of software is undecideable” or something like that.
Thanks for the feedback!