Really? Isn’t editing one’s goal directly contrary to one’s goal? If an AI self-edits in such a way that its goal changes, it will predictably no longer be working towards that goal, and will thus not consider it a good idea to edit its goal.
It depends on how it decides whether or not changes are a good thing. If is trying out two utility functions- Ub for utility before and Ua for utility after- you need to be careful to ensure it doesn’t say “hey, Ua(x)>Ub(x), so I can make myself better off by switching to Ua!”.
Ensuring that doesn’t happen is not simple, because it requires stability throughout everything. There can’t be a section that decides to try being goalless, or go about resolving the goal in a different way (which is troublesome if you want it to cleverly use instrumental goals).
[edit] To be clearer, you need to not just have the goals be fixed and well-understood, but every part of everywhere else also needs to have a fixed and well-understood relationship to the goals (and a fixed and well-understood sense of understanding, and …). Most attempts to rewrite source code are not that well-planned.
Really? Isn’t editing one’s goal directly contrary to one’s goal? If an AI self-edits in such a way that its goal changes, it will predictably no longer be working towards that goal, and will thus not consider it a good idea to edit its goal.
It depends on how it decides whether or not changes are a good thing. If is trying out two utility functions- Ub for utility before and Ua for utility after- you need to be careful to ensure it doesn’t say “hey, Ua(x)>Ub(x), so I can make myself better off by switching to Ua!”.
Ensuring that doesn’t happen is not simple, because it requires stability throughout everything. There can’t be a section that decides to try being goalless, or go about resolving the goal in a different way (which is troublesome if you want it to cleverly use instrumental goals).
[edit] To be clearer, you need to not just have the goals be fixed and well-understood, but every part of everywhere else also needs to have a fixed and well-understood relationship to the goals (and a fixed and well-understood sense of understanding, and …). Most attempts to rewrite source code are not that well-planned.