Things that can go wrong with decomposed modules
Some thoughts on what could go wrong with the decomposed modules that Eric Drexler is currently analysing:
There are systems that are unsafe if:
a) Every module is a superintelligent agent. b) One module is a superintelligent agent. c) No modules are superintelligent agents, but their input-output behaviour is not fully understood. d) No modules are superintelligent agents, but their input-output behaviour are fully understood.
Now, d) seems unrealistic; my main objections are relevant to b), but they can shade into c) too.
Consider a system withe at least two modules, module P which produces plans for bridges, and module C which checks these plans. Because of imperfections in P, it is not only motivated to produce the best bridge plans it can, but it’s also motivated to have C accept the plans.
Except that C is on a timer, and, if it can’t check the plans within a set time, it rejects them. The complicated plans that P is motivated to produce are likely to be rejected on these grounds. Therefore P is not only motivated to produce complicated plans, but to “phrase” them in ways that C can quickly parse. Maybe including executive summaries, or pre-computing part of what C needs to do, and so on.
How would it acquire the ability to do so? Well, if we assume b), then P is already a superintetelligent agent, so that’s easy. If not, it might acquire knowledge of C through testing some plans with it. Or there might be a module C’ in the system, that also checks plans, but in a way that P doesn’t care about. If P has access to C’, it can rapidly test a lot of plans with it, and pass those on to C.
And, of course, there is but one step from “making your plans much easier to check” to “fooling the checker into accepting an imperfect plan”.
A last element that relevant to this is that modules may come to realise the capabilities of other modules, and thus gain them by passing that message on. If module A can demonstrate to module B that a certain message came from module C, then module A effectively has the capabilities of C, and the fact that C and B are not connected ceases to matter.
See this LessWrong thread for another type of problem with some of these adversarial approaches (including the “meeting halfway” proposal).
Roughly the problem is that, if one system is generating a long string and another system is checking it, then it’s easy for the first system to change the string in an important way without the second system noticing (e.g. using steganography). This seems like an important problem to think about in these adversarial-type approaches.