homunq_homunq comments on Evaluating the historical value misspecification argument

homunq_homunq 8 Oct 2023 20:02 UTC
1 point
−2
I absolutely “disagree that AI systems in the near-future will be capable of distinguishing valuable from non-valuable outcomes about as reliably as humans”. In particular, I think that progress here in the near future will resemble self-driving-car progress over the near past. That is to say, it’s far easier to make something that’s mostly right most of the time, than to make something that is reliably not wrong in a way that I think humans under ideal conditions can in fact achieve.
Basically, I think that the current paradigm (in general: unsupervised deep learning on large in large datasets using reasonably- parallelizable architectures, possibly followed by architectural adjustments and/or supervised tuning) is unsuited to making systems that “care enough to be reliable” — that can reliably notice their own novel mistakes and reflectively adjust to correct them. Now, obviously, it’s easy to set up situations where humans will fail at that, too; but I think there is still a realm of situations where humans can be unambiguously more reliable than machines.
I realize that I’m on philosophically dangerous ground here, because a protocol to test this would have to be adverserial towards machines, but also refrain from certain using certain adverserial tricks known to work against humans. So it may be that I’m just biased when I see the anti-human tricks as “cheating” and the anti-machine ones as “fair game”. But I don’t think it’s solely bias. I could make arguments that it’s not, but I suspect that on this point my arguments would not be that much superior to the ones you (or even ChatGPT) would fill in for me.
Shorter me: I think that in order to “specify an explicit function that corresponds to the “human value function” with fidelity comparable to the judgement of an average human” in the first place, you have to know how to build an AI that can be meaningfully said to have any values at all; that we don’t know how to do that; and that we are not close to knowing.
(I am not at ALL making a Chinese-room-type argument about the ineffability of actually having values here. This is purely an operational point, where “having values” and “reliably doing as well as a human at seeming to have values, in the presence of adversarial cues” are basically the same. And by “adversarial cues” I mean more along the lines of “prove that 27^(1/3) is irrational” not “I’ll give you a million dollars to kick this dog”, though obviously it’s easy to design against any specific such cue.)