A lot of the new ideas I’ve been posting could be parodied as going something like this:
The AI A, which is utility indifferent to the existence of AI B, has utility u (later corriged to v’, twice), and it will create a subagent C which believe via false thermodynamic miracles that D does not exist, while D’ will hypothetically and counterfactually use two different definitions of counterfactual so that the information content of its own utility cannot be traded with a resource gathering agent E that doesn’t exist (assumed separate from its unknown utility function)...
What is happening is that I’m attempting to define algorithms that accomplish a particular goal (such as obeying the spirit of a restriction, or creating a satisficer). Typically this algorithm has various underdefined components—such as inserting an intelligent agent at a particular point, controlling the motivation of an agent at a point, effectively defining a physical event, or having an agent believe (or act as if they believed) something that was incorrect.
The aim is to reduce the problem from stuff like “define human happiness” to stuff like “define counterfactuals” or “pinpoint an AI’s motivation”. These problems should be fundamentally easier—if not for general agents, then for some of the ones we can define ourselves (this may also allow us to prioritise research directions).
And I also have no doubt that once a design is available, that it will be improved upon and transformed and made easier to implement and generalise. Therefore I’m currently more interested in objections of the form “it won’t work” than “it can’t be done”.
A counterfactual and hypothetical note on AI safety design
A putative new idea for AI control; index here.
A lot of the new ideas I’ve been posting could be parodied as going something like this:
The AI A, which is utility indifferent to the existence of AI B, has utility u (later corriged to v’, twice), and it will create a subagent C which believe via false thermodynamic miracles that D does not exist, while D’ will hypothetically and counterfactually use two different definitions of counterfactual so that the information content of its own utility cannot be traded with a resource gathering agent E that doesn’t exist (assumed separate from its unknown utility function)...
What is happening is that I’m attempting to define algorithms that accomplish a particular goal (such as obeying the spirit of a restriction, or creating a satisficer). Typically this algorithm has various underdefined components—such as inserting an intelligent agent at a particular point, controlling the motivation of an agent at a point, effectively defining a physical event, or having an agent believe (or act as if they believed) something that was incorrect.
The aim is to reduce the problem from stuff like “define human happiness” to stuff like “define counterfactuals” or “pinpoint an AI’s motivation”. These problems should be fundamentally easier—if not for general agents, then for some of the ones we can define ourselves (this may also allow us to prioritise research directions).
And I also have no doubt that once a design is available, that it will be improved upon and transformed and made easier to implement and generalise. Therefore I’m currently more interested in objections of the form “it won’t work” than “it can’t be done”.