I want to add to the general agreement here that I broadly agree with this strategy and think that in practice this kind of things (multiple alignment solutions strapped onto a system in parallel) will be what alignment success looks like in parallel, and have written about this before. This agreement is primarily with the swiss-cheese/assemblage strategy.
It is important to note that I think that hodgepodge alignment strategies only really apply to near-term proto-AGI systems that we will build and won’t work against a system that has strongly recursively-self-improved into a true super intelligence (insofar as this is possible). However a key thing we need to figure out is how to prevent this from happening to our near-term AGI systems in an uncontrolled way and I am hopeful that just layering of multiple methods can help here substantially.
I am also somewhat sceptical as using type-theory/category-theoretic approaches of assemblages as a means of proof of safety guarantees (vs as a thinking tool which I strongly think can be helpful). This is basically because it is unclear to me if types are sufficiently strong objects to let us reason about properties like safety (i.e. even in strongly typed languages the type-checker can tell you at compile-time if the function will output an int but cannot guarantee that the int will be < 9000 and I think something like the latter is the kind of guarantee we need for alignment).
i.e. even in strongly typed languages the type-checker can tell you at compile-time if the function will output an int but cannot guarantee that the int will be < 9000 and I think something like the latter is the kind of guarantee we need for alignment
Define a new type that only allows ints < 9000 if needed?
For your broader question, I think there might be some safety relevant properties of some alignment primitives.
Beren, have you heard of dependent types, which are used in Coq, Agda, and Lean? (I don’t mean to be flippant; your parenthetical just gives the impression that you hadn’t come across them, because they can easily enforce integer bounds, for instance.)
I want to add to the general agreement here that I broadly agree with this strategy and think that in practice this kind of things (multiple alignment solutions strapped onto a system in parallel) will be what alignment success looks like in parallel, and have written about this before. This agreement is primarily with the swiss-cheese/assemblage strategy.
It is important to note that I think that hodgepodge alignment strategies only really apply to near-term proto-AGI systems that we will build and won’t work against a system that has strongly recursively-self-improved into a true super intelligence (insofar as this is possible). However a key thing we need to figure out is how to prevent this from happening to our near-term AGI systems in an uncontrolled way and I am hopeful that just layering of multiple methods can help here substantially.
I am also somewhat sceptical as using type-theory/category-theoretic approaches of assemblages as a means of proof of safety guarantees (vs as a thinking tool which I strongly think can be helpful). This is basically because it is unclear to me if types are sufficiently strong objects to let us reason about properties like safety (i.e. even in strongly typed languages the type-checker can tell you at compile-time if the function will output an int but cannot guarantee that the int will be < 9000 and I think something like the latter is the kind of guarantee we need for alignment).
Define a new type that only allows ints < 9000 if needed?
For your broader question, I think there might be some safety relevant properties of some alignment primitives.
Beren, have you heard of dependent types, which are used in Coq, Agda, and Lean? (I don’t mean to be flippant; your parenthetical just gives the impression that you hadn’t come across them, because they can easily enforce integer bounds, for instance.)