An observation: there are sequences of actions the system can take that might result in very little change in the world state, but which do expand the actions available to the system.
For example, if the system starts with the set of actions that allow it to control an internet-connected web browser, it can use those actions to write and run a program using a browser-based IDE like Replit. Writing the program itself doesn’t have a large effect on the world state (it modifies a few kB worth of bits on some hard disks in Replit’s servers), but once the program is written, the system has a new action available: run the program. Lots of other kinds of “action-expanding” sequences are possible.
By regarding world states that are similar enough to each other as equivalent, the tree becomes a graph. Can any principles of corrigibility be reformulated strictly in terms of mathematical properties of this graph and the set of actions available at each node?
An observation: there are sequences of actions the system can take that might result in very little change in the world state, but which do expand the actions available to the system.
For example, if the system starts with the set of actions that allow it to control an internet-connected web browser, it can use those actions to write and run a program using a browser-based IDE like Replit. Writing the program itself doesn’t have a large effect on the world state (it modifies a few kB worth of bits on some hard disks in Replit’s servers), but once the program is written, the system has a new action available: run the program. Lots of other kinds of “action-expanding” sequences are possible.
By regarding world states that are similar enough to each other as equivalent, the tree becomes a graph. Can any principles of corrigibility be reformulated strictly in terms of mathematical properties of this graph and the set of actions available at each node?