It is almost certainly too long. Could use editing/distillation/executive-summary. I erred on the side of leaving more in, since the audience I’m most concerned with are those who’re actively working in this area (though for them there’s a bit much statement-of-the-obvious, I imagine).
I don’t think most of it is new, or news to the authors: they focused on the narrow version for a reason. The only part that could be seen as a direct critique is the downside risks section: I do think their argument is too narrow.
As it relates to Truthful AI, much of the rest can be seen in terms of “Truthfulness amplification doesn’t bridge the gap”. Here again, I doubt the authors would disagree. They never claim that it would, just that it expands the scope—that’s undeniably true.
On being net-positive below a certain threshold, I’d make a few observations:
For the near-term, this post only really argues that the Truthful AI case for positive impact is insufficient (not broad enough). I don’t think I’ve made a strong case the the output would be net negative, just that it’s a plausible outcome (it’d be my bet for most standards in most contexts).
I do think such standards would be useful in some sense for very near future AIs—those that are not capable of hard-to-detect manipulation. However, I’m not sure eliminating falsehoods there is helpful overall: it likely reduces immediate harm a little, but risks giving users the false impression that AIs won’t try to mislead them. If the first misleading AIs are undetectably misleading, that’s not good.
Some of the issues are less clearly applicable in a CAIS-like setup, but others seem pretty fundamental: e.g. that what we care about is something like [change in accuracy of beliefs] not [accuracy of statement]. The “all models are wrong” issue doesn’t go away. If you’re making determinations in the wrong language game, you’re going to make errors.
Worth emphasizing that ”...and this path requires something like intent alignment” isn’t really a critique. That’s the element of Truthfulness research I think could be promising—looking at concepts in the vicinity of intent alignment from another angle. I just don’t expect standards that fall short of this to do much that’s useful, or to shed much light on the fundamentals.
Thanks. A few thoughts:
It is almost certainly too long. Could use editing/distillation/executive-summary. I erred on the side of leaving more in, since the audience I’m most concerned with are those who’re actively working in this area (though for them there’s a bit much statement-of-the-obvious, I imagine).
I don’t think most of it is new, or news to the authors: they focused on the narrow version for a reason. The only part that could be seen as a direct critique is the downside risks section: I do think their argument is too narrow.
As it relates to Truthful AI, much of the rest can be seen in terms of “Truthfulness amplification doesn’t bridge the gap”. Here again, I doubt the authors would disagree. They never claim that it would, just that it expands the scope—that’s undeniably true.
On being net-positive below a certain threshold, I’d make a few observations:
For the near-term, this post only really argues that the Truthful AI case for positive impact is insufficient (not broad enough). I don’t think I’ve made a strong case the the output would be net negative, just that it’s a plausible outcome (it’d be my bet for most standards in most contexts).
I do think such standards would be useful in some sense for very near future AIs—those that are not capable of hard-to-detect manipulation. However, I’m not sure eliminating falsehoods there is helpful overall: it likely reduces immediate harm a little, but risks giving users the false impression that AIs won’t try to mislead them. If the first misleading AIs are undetectably misleading, that’s not good.
Some of the issues are less clearly applicable in a CAIS-like setup, but others seem pretty fundamental: e.g. that what we care about is something like [change in accuracy of beliefs] not [accuracy of statement]. The “all models are wrong” issue doesn’t go away. If you’re making determinations in the wrong language game, you’re going to make errors.
Worth emphasizing that ”...and this path requires something like intent alignment” isn’t really a critique. That’s the element of Truthfulness research I think could be promising—looking at concepts in the vicinity of intent alignment from another angle. I just don’t expect standards that fall short of this to do much that’s useful, or to shed much light on the fundamentals.
...but I may be wrong!