Koen.Holtman comments on On Solving Problems Before They Appear: The Weird Epistemologies of Alignment

Koen.Holtman Dec 8, 2021, 12:13 PM
LW: 1 AF: 1
AF

“we’re going to design through proof the most advanced and complex program that ever existed, orders of magnitudes more complex than the most complex current systems”.

I disagree that AI code is orders of magnitude more complex than say the code in a web browser or modern compiler: in fact quite the opposite applies. Most modern ML algorithms are very short pieces of code. If you are willing to use somewhat abstract math where you do not write out all the hyperparameter values, you can specify everything that goes on in a deep learning algorithm in only a few lines of pseudocode. Same goes for most modern RL algorithms.

I also note that while modern airplanes contain millions of lines of code, most of it is in the on-board entertainment systems. But the safety-critical subsystems in airplanes them tend to be significantly smaller in code size, and they also run air-gapped from the on-board entertainment system code. This air-gapping of course plays an important role in making the formal proofs for the safety-critical subsystems possible.

But beyond these observations: the main point I am trying to get across is that I do not value formal methods as a post-hoc verification tool that should be applied to millions of lines of code, some of it spaghetti code put together via trial and error. Clearly that approach would not work very well.

I value formal methods as a tool for the aligned AI code specification and design phases.

On the design phase: formal methods offer me a way to create many alternative and clarifying viewpoints on what the code is doing or intending to do, viewpoints not offered by the text of the code itself. For example, formally expressed loop invariants can express much more clearly what is going on in a loop than the loop code itself. Global invariants that I can formulate for distributed system state can allow me to express more clearly how the protocols I design manage to avoid deadlock than the code itself. So the main value if the hand-in-hand method, as a code design method, is that you can develop these clarifying mathematical viewpoints, even before you start writing the code. (Not all programmers are necessarily good at using formal methods to clarify what they are doing, or to accelerate their work. I have heard a formal tool vendor say that formal methods based tools only make the top 10% of programmers in any development organisation more productive.)

On the specification phase: formal methods can allow me to express safety properties I seek for AGI systems in a much more clear and precise way than any piece of code can, or any natural language statement can.

If we had such formal methods backed approaches, I could see DeepMind potentially using it, maybe OpenAI, but not chinese labs or all the competitors we don’t know about. And the fact that it would slow down whoever uses it for 2 years would be a strong incentive for even the aligned places to not use it

No, the the design flow here is that alignment researchers spend these 2 years of time using formal methods to develop and publish AGI safety mechanisms right now, before anybody invents viable AGI. The designs we develop should then preferably take far less than 2 years, and no deep mathematical skills, to combine with an AGI-level ML algorithm, if one ever invents one.

OK, I really need to post the promised list of list of alignment papers were formal methods are actually used in the clarifying design-tool way I describe above. Talking about it in the abstract is probably not the best way to get my points across. (I am in the middle of attending virtual NeurIPS, so I have not found the time yet to compile the promised list of papers.)

BTW, bit of a side issue: I do not have an x-risk fault tree model where I assume that a ‘chinese lab’ is automatically going to be less careful than a ‘western lab’, But as a rhetorical shorthand I see what you are saying.
- Koen.Holtman Dec 11, 2021, 5:02 PM
  LW: 1 AF: 1
  AF Parent
  OK, here is the promised list formal methods based work which has advanced the field of AGI safety. So these are specific examples to to back up my earlier meta-level remarks where I said that formal methods are and have been useful for AGI safety.
  
  To go back to the Wikipedia quote:
  
  The use of formal methods for software and hardware design is motivated by the expectation that, as in other engineering disciplines, performing appropriate mathematical analysis can contribute to the reliability and robustness of a design.
  
  There are plenty of people in CS who are inspired by this thought: if we can design safer bridges by using the tool of mathematical analysis, why not software?
  
  First, some background comments about formal methods in ML
  
  In modern ML, there are a lot of people who use mathematical analysis in an attempt to better understand or characterize the performance of ML algorithms. Such people have in fact been there since the start of AI as a field.
  
  I am attending NeurIPS 2021, and there are lots of papers which are all about this use of formal methods, or have sections which are all about this. To pick a random example, there is ‘Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss’ (preprint here) where the abstract says things like
  
  Minimizing this objective leads to features with provable accuracy guarantees under linear probe evaluation. By standard generalization bounds, these accuracy guarantees also hold when minimizing the training contrastive loss.
  
  (I have not actually read the paper. I am just using this as a random example.)
  
  In general, my sense is that in the ML community right now, there is a broad feeling that mathematical analysis still needs to have some breakthrough before it can say truly useful things, truly useful things when it comes to safety engineering, for AIs that use deep neural networks.
  
  If you read current ML research papers on ‘provable guarantees’ and then get very unexited, or even deeply upset at how many simplifying assumptions the authors are making in order to allow for any proofs at all, I will not blame you. My sense is also that different groups of ML researchers disagree about how likely such future mathematical breakthroughs will be.
  
  As there is plenty of academic and commercial interest in this formal methods sub-field of ML, I am not currently working in it myself, and I do not follow it closely. As an independent researcher, I have the freedom to look elsewhere.
  
  I am looking in a different formal methods corner, one where I abstract away from the details of any specific ML algorithm. I then try to create well-defined safety mechanisms which produce provable safety properties, properties which apply no matter what ML algorithm is used. This ties back into what I said earlier:
  
  the design flow here is that alignment researchers spend these 2 years of time using formal methods to develop and publish AGI safety mechanisms right now, before anybody invents viable AGI.
  
  The list
  
  So here is an annotated list of some formal methods based work which has, in my view, advanced the field of AGI safety. (To save me some time, I will not add any hyperlinks, pasting the title into google usually works.)
  - Hutter’s AIXI model, ‘Universal algorithmic intelligence: A mathematical top down approach’ models superintelligence in a usefully universal way.
  - I need to point out however that the age-old $π^{*}$ optimal-policy concept in MDPs also models superintelligence in a usefully universal way, but at a higher level of abstraction.
  - MIRI’s 2015 paper ‘Corrigibility’ makes useful contributions to defining provable safety properties, though it then also reports failure in the effort of attempting to design an agent that provably satisfies them. At a history-of-ideas level, what it interesting here is that 2015 MIRI seemed to have been very much into using formal methods to make progress on alignment, but 2021 MIRI seems to feel that there is much less promise to the approach. Or at least, going by recently posted discussions, 2021 Yudkowsky seems to have lost all hope that these methods will end up saving the world. My own opinion is much more like 2015 MIRI than it is like 2021 MIRI. Modulo of course the idea that people meditating on HPMOR is a uniquely promising approach to solving alignment. I have never much felt he appeal of that approach myself. I think you have argued recently that 2021 Yudkowsky has also become less enthusiastic about the idea.
  - The above 2015 MIRI paper owes a lot to Armstrong’s 2015 ‘Motivated value selection for artificial agents’, the first official academic-style paper about Armstrong’s indifference methods. This paper arguably gets a bit closer to describing an agent design that provably meets the safety properties in the MIRI paper. The paper gets closer if we squint just right and interpret Armstrong’s 2005 somewhat ambiguous conditional probability notation for defining counterfactuals in just the right way.
  - My 2019 paper ‘Corrigibility with Utility Preservation’ in fact can be read to show how you need to squint in just the right way if you want to use Armstrong’s indifference methods to create provable corrigibility related safety properties. The paper also clarifies a lot of questions about agent self-knowledge and emerging self-preservation incentives. Especially because of this second theme, the level of math in this paper tends to be more than what most people can or want to deal with.
  - In my 2020 ‘AGI Agent Safety by Iteratively Improving the Utility Function’ I use easier and more conventional MDP math to define a corrigible (for a definition of corrigibility) agent using Armstrong’s indifference methods as a basis. This paper also shows some results and design ideas going beyond stop buttons and indifference methods.
  - In my 2021 ‘Counterfactual Planning in AGI Systems’ I again expand on some of these themes, developing a different mathematical lens which I feel clarifies several much broader issues, and makes the math more accessible.
  - Orseau and Armstrong’s 2016 ‘Safely interruptible agents’ is often the go-to cite when people want to cite Armstrong’s indifference methods. In fact it formulates and proves (for some ML systems) a related ‘safely interruptable’ safety property, a property which we also want to have if we apply a design like indifference methods.
  - More recent work which expands on safe interruptability is the 2021 ‘How RL Agents Behave When Their Actions Are Modified’ by Langlois and Everitt.
  - Everitt and Hutter’s 2016 ‘Self-modification of policy and utility function in rational agents’ also deals with agent self-knowledge and emerging self-preservation incentives, but like my above 2019 paper which also deals with the topic, the math tends to be a bit too much for many people.
  - Carey, Everitt, et al. 2021 ‘Agent Incentives: A Causal Perspective’ has (the latest version of) a definition of causal influence diagrams and some safety properties we might define by using them. The careful mathematical definitions provided in the paper use Pearl’s style of defining causal graphs, which is again too much for some people. I have tried to offer a more accessible introduction and definitional framework here.
  - Everitt et al’s 2019 ‘Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective’ uses the tools defined above to take a broader look at various safety mechanisms and their implied safety properties.
  - The 2020 ‘Asymptotically unambitious artificial general intelligence’ by Cohen et al designs a generic system where a uncertain-about-goals agent can call a mentor, and then proves some bounds for it.
  - I also need to mention the 2017 ‘The Off-Switch Game’ by Hadfield-Menell et al. Like the 2015 MIRI paper on corrigibility, this paper ends up noting that it has clarified some problems, but without solving all of them.
  - There is lots of other work, but this should give you an idea of what I mean with formal methods work that defines useful AGI safety properties in a mathematically precise way.
  Overall, given the results in the above work, I feel that formal methods are well underway in clarifying how one might define useful AGI safety properties, and how we might built them into AGI agents, if AGI agents are ever developed.
  
  But it is also clear from recent comments like this one that Yudkowsky feels that no progress has been made. Yudkowsky is not alone in having this opinion.
  
  I feel quite the opposite, but there is only a limited amount of time that I am willing to invest in convincing Yudkowsky, or any other participants on this forum, that a lot of problems in alignment and embeddedness are much more tractable than they seem to believe.
  
  In fact, in the last 6 months or so, I have sometimes felt that the intellectually interesting open purely mathematical problems have been solved to such an extent that I need to expand the scope of my research hobby. Sure, formal methods research on characterizing and improving specific ML algorithms could still use some breakthroughs, but this is not the kind of math that I am interested in working on as a self-funded independent researcher. So I have recently become more interested in advancing the policy parts of the alignment problem. As a preview, see my paper here, for which I still need to write an alignment forum post.

Koen.Holtman comments on On Solving Problems Before They Appear: The Weird Epistemologies of Alignment

First, some background comments about formal methods in ML

The list