Tor Økland Barstad comments on Making it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad 10 Jul 2022 8:46 UTC
12 points
0
Some comments regarding first uses of AGI outside of the digital realm (or on the internet)
I plan to write about this in more detail, but in the meantime I feel like noting down a few thoughts.
Maybe strategic considerations will speak in favor of haste, but even if a team feels they only have days/weeks/months to wait before making use of AGI in the non-digital realm (or on the internet). Even if so, there could still be strong reasons for doing things step by step (but not necessarily with a huge amount of time between each step).
On the relatively less risky side of things we have things such as:
- Using AGI-system in ways where it receives data, but where we don’t e.g. implement AGI-based designs (e.g. giving the AGI-system lots of data from various sensors and sources, and see if can infer the placement of all the world’s nuclear submarines with a high level of confidence)
- Letting it design circuits for circuit boards, and produce proofs for why the code run on these circuit-boards will be behaviorally equivalent to more high-level and less computationally efficient implementations of certain programs/subroutines (but not yet letting it design computer components “from scratch”)
- Letting it help with tasks related to surveillance (inferring stuff based on data that it receives), but only giving it data-dumps from the internet (not giving it access to the internet, and not using programs it has designed to do hacking)
But at some point we will want/need to use superintelligent AGI-systems in ways that are more risky than this. Maybe first to e.g. make more efficient computers, and later so as to reduce/prevent risks from maybe less scrupulous actors that also may be developing superintelligent AGI-systems.
Molecular nanotechnology / atomically precise manufacturing, or maybe extremely advanced biotechnology, may be an example of types of technologies that could provide a decisive technological advantage (especially in FOOM-like scenarios).
Clearly it could be quite risky to follow instruction-steps for advanced nanotechnological systems from a superintelligent AGI! I hope that before we make any use of superintelligent AGI-systems outside of the digital realm (or on the internet) we will first have been able to:
- Generate a “council” of “siloed” AGI-systems that are based in different alignment-methodologies/architectures/principles (so we can see if they have converging output for similar questions/requests, and see if they can see flaws in answers/proofs/arguments/implementations from other AGI-systems)
- Gain the capability of making use of computable proofs in a more extensive and efficient way than today, and making very extensive and “holistic” use of such “proofs” (I’ll write more about that in part 4 and 5 of this series)
This quote, which is about brain emulations, is also relevant to STVs (verifiable/transparent programs) that predict/describe how construction-steps and machines would behave/unfold in the real world:
One dilemma in regards to simulations is how fine-grained they are, and how they handle a model where the details are unclear (they simulate something that exists in the real world, but they are not given precise and accurate data of starting conditions). This is not just a dilemma for brain simulations, but for simulations of any physical system. Something we may want is a system that gives an accurate description of the range of possible/probable outcomes, given a description of the range of possible/probable starting conditions. And we want the possibility for the simulation to not spend lots of computation on details we don’t care about (only computing details that are useful, or that are helpful for verification of simulation). Since these are general-purpose challenges, which aren’t specific to brain simulations, we may want to have STVs that can help generate “simulation-approximations” for any physical system. That way we can also test if they do a consistently accurate job when used to make predictions about other physical systems (and not only be tested for whether or not they do a good job with brain emulations).
In other words, we want “simulators”, but not “simulators” in the classical sense of a program that starts out with a specific physical model and simulates some specific ways that things could unfold (using e.g. monte carlo simulations). What we want is a program that gets descriptions of a physical system (where some things are specified and others not), and infers what can be inferred regarding how that physical system would unfold (with some things being specified and others not). And this program should be general enough that it can be tested against existing real-world data from various physical systems. How specific predictions such programs would be able to output would depend on the specificity of the input.
If a nanosystem for example would behave in some “weird” (but “non-random”) way in specific circumstances, a system should be able to detect that if those specific circumstances are within the space of possible/probable starting conditions. And if it wouldn’t be able to detect it, it should be able to make it clear what kinds of “weird” (but “non-random”) behaviors it’s unable to rule out.
In regards to designs (and construction processes) for the first nano systems, the more “multiple assurances” these can have, the better. By “multiple assurances” I mean that we could be wrong about some things, but even if we are that doesn’t stop the system from working as we want it to (and even if in certain ways it doesn’t work as we want it to, it still doesn’t behave in ways that are catastrophic).
I am reminded of these excerpts from Eric Drexler’s book Radical Abundance:
Thus, in comparing problems in engineering and science, one often encounters quite different questions and objects of study, and what counts as a good-enough calculation typically differs as well—and indeed can differ enormously. Here, engineers have a luxury. Within broad limits they can choose both their systems and their questions to fit within the bounds of knowledge and calculation. This freedom of choice can help engineers design artifacts that will perform as expected—provided, of course, that the unknown is somehow held at bay.
(...)
The world’s messy complexity often gives engineers reason to shield their creations inside a box of some sort, like a computer or printer in a case, or an engine under the hood of a car. A box provides insulation against the complex, uncontrolled, and unpredictable world of nature (and children).
The external world still causes problems at boundaries, of course. Consider a ship, or a pacemaker. Ship hulls keep out the sea, but the sea attaches barnacles; engineers hide a pacemaker’s electronics inside a shell, but the body’s immune system and tissue remodeling make medical implant design far more than an exercise in physics. This kind of problem doesn’t arise for an APM system protected by a box on a table, which is sheltered in turn inside the box-like walls of a house.
Engineers can solve many problems and simplify others by designing systems shielded by barriers that hold an unpredictable world at bay. In effect, boxes make physics more predictive and, by the same token, thinking in terms of devices sheltered in boxes can open longer sightlines across the landscape of technological potential. In my work, for example, an early step in analyzing APM systems was to explore ways of keeping interior working spaces clean, and hence simple.
Note that designed-in complexity poses a different and more tractable kind of problem than problems of the sort that scientists study. Nature confronts us with complexity of wildly differing kinds and cares nothing for our ability to understand any of it. Technology, by contrast, embodies understanding from its very inception, and the complexity of human-made artifacts can be carefully structured for human comprehension, sometimes with substantial success.
(...)
Many systems amplify small differences this way, and chaotic, turbulent flow provides a good example. Downstream turbulence is sensitive to the smallest upstream changes, which is why the flap of a butterfly’s wing, or the wave of your hand, will change the number and track of the storms in every future hurricane season.
Engineers, however, can constrain and master this sort of unpredictability. A pipe carrying turbulent water is unpredictable inside (despite being like a shielded box), yet can deliver water reliably through a faucet downstream. The details of this turbulent flow are beyond prediction, yet everything about the flow is bounded in magnitude, and in a robust engineering design the unpredictable details won’t matter.
Likewise, in APM systems, thermal fluctuations are unpredictable in detail, but bounded in magnitude, and a discipline of robust design once again can give satisfactory results.
(...)
Engineers work with knowledge of limited scope and precision in much the same way that they work with motors of limited power and materials of limited strength. Within specific constraints of knowledge, some tasks can be achieved predictably, while others cannot.
Coping with limited knowledge is a necessary part of design and can often be managed. Indeed, engineers designed bridges long before anyone could calculate stresses and strains, which is to say, they learned to succeed without knowledge that seems essential today. In this light, it’s worth considering not only the extent and precision of scientific knowledge, but also how far engineering can reach with knowledge that remains incomplete and imperfect.
(...)

In talking with the director of a major nanotechnology research center (a leader in his field), I mentioned modeling mechanical nanosystems using standard molecular mechanics methods for computational molecular dynamics. He objected that these methods are “inaccurate.” I agreed that they have too little accuracy for some purposes (for example, some aspects of biomolecular modeling), but noted that they have more than enough accuracy for others, and that the systems I’d studied (rigid, stable gears and bearings, for example) were of the latter sort. He paused, looked thoughtful, and we moved on to another subject.
The first instructions for implementing nanomachines should be done in conditions where we can measure the construction-process in various ways, and where the construction-process somehow is enclosed (in ways that make self-replication beyond the enclosement impossible, even if we are wrong about many details having to do with the nano machinery). Maybe the construction-process should be slow-moving by chemical necessity, with the possibility of quickly and automatically turning everything inside the enclosement into plasma if anything unfolds that is contradictory to expectations.
The enclosement and the measurement should not be something that the AGI-systems/STVs that propose the construction-steps/machinery should be informed about. The designs for the construction-steps/machinery should be of such a kind that they can be expected to work robustly in a very wide range of conditions (making it so that informing of enclosement isn’t necessary).
I’m very much not a physicist or a chemist, so I don’t know how an enclosement may be constructed in such a way as to maximize assurances of not having nanosystem/nanocomponents escape the enclosure. Vague ideas that come to mind involve:
- Thick multi-layered walls of different materials, making it chemically impossible for reaction dig its way through (given amount and type of materials used in experiment/construction)
- Thick multi-layered walls with temperatures that are very cold (maybe close to absolute zero), making it energetically impossible to escape enclosement
- Magnetics + vacuum
- Vacuum of space (e.g. running certain experiments inside specially constructed satellite, or some heavily enclosed box that is sent slowly into blackness of space)
- Enclosement where internals will be turned to plasma shortly after experiment has been run
Such strategies may of course be advantageous to have planned out ahead to some degree, and have prepared for in advance to some degree (rather than just improvising on the fly).
Part of the challenge is:
- How do we ensure that it’s infeasible for the construction-process/machine/component to “fool” the measurment/surveilance-system (even if procedure diverges from with expectations)?
- How do we ensure that measurment/surveilance-system doesn’t compromise enclosement?