JNS comments on A tension between two prosaic alignment subgoals

JNS 19 Mar 2023 17:40 UTC
2 points
1
avoiding harmful outputs entails training AI systems never to produce information that might lead to dangerous consequences
I don’t see how that is possible, in the context of a system that can “do things we want, but do not know how to do”.
The reality of technology/tools/solutions seems to be that anything useful is also dual use.
So when it comes down to it, we have to deal with the fact that such as system certainly will have the latent capability to do very bad things.
Which means we have to somehow ensure that such as system does not go down such a road either instrumentally or terminally.
As far as I can tell, intelligence^[1] fundamentally is incapable of such a thing, which leaves us roughly with this:
1. Pure intelligence, onus is on us to specify terminal goals correctly.
2. Pure intelligence and cage/rules/guardrails^[2] etc.
3. Pure intelligence with a mind explicitly in charge of directing the intelligence.
On the first try of “do thing we want, but do not know how to do”:
1) kills us every time
2) kills us almost every time
3) might not kills us every time
And that’s as far as my thinking currently goes.
I am stuck on if 3 could get us anywhere sensible (my mind screams “maybe”………”ohh boy that looks brittle”).
1. ^
  I don’t have a firm definition of the term, but I approximately think of intelligence as the function that lets a system take some goal/task and find a solution.
  Explicitly in humans, well me, that looks like using the knowledge I have, building model(s), evaluating possible solution trajectories within the model(s), gaining insight, seeking more knowledge. And iterating over all that until I either have a solution or give up.
2. ^
  The usual: Keep it in a box, modify evaluation to exclude bad things and so on. And that suffers the problem of we can’t robustly specify what is “bad” and even if we could, Rice’s Theorem heavily implies checking is impossible.