Until you specify the format of a description of the problem, and how the program figures out how to write a program to solve the problem, it is hard to tell if this would be safe.
And if you don’t know that it is safe, it isn’t. Using some barrier like “fixed computational resources” to contain a non-understood process is a red flag.
The format of the description is something I’m struggling with, but I’m not clear how it impacts safety.
How the AI figures things out is up to the human programmer. Part of my intent in this exercise is to constrain the human to solutions they fully understand. In my mind my original description would have ruled out evolving neural nets, but now I see I definitely didn’t make that clear.
By ‘fixed computational resources’ I mean that you’ve got to write the program such that if it discovers some flaw that gives it access to the internet, it will patch around that access because what it is trying to do is solve the puzzle of (solving the puzzle using only these instructions and these rules and this memory.)
What I’m looking for is a way to work on friendliness using goals that are much simpler than human morality, implemented by minds that are at least comprehensible in their operation, if not outright step-able.
Until you specify the format of a description of the problem, and how the program figures out how to write a program to solve the problem, it is hard to tell if this would be safe.
And if you don’t know that it is safe, it isn’t. Using some barrier like “fixed computational resources” to contain a non-understood process is a red flag.
The format of the description is something I’m struggling with, but I’m not clear how it impacts safety.
How the AI figures things out is up to the human programmer. Part of my intent in this exercise is to constrain the human to solutions they fully understand. In my mind my original description would have ruled out evolving neural nets, but now I see I definitely didn’t make that clear.
By ‘fixed computational resources’ I mean that you’ve got to write the program such that if it discovers some flaw that gives it access to the internet, it will patch around that access because what it is trying to do is solve the puzzle of (solving the puzzle using only these instructions and these rules and this memory.)
What I’m looking for is a way to work on friendliness using goals that are much simpler than human morality, implemented by minds that are at least comprehensible in their operation, if not outright step-able.