I think that an important part of this is ‘agent foundations’, by which I broadly mean a theory of what agents should look like, and what structural facts about agents could cause them to display undesired behaviour. [emphasis Rohin’s]
Huh? Surely if you’re trying to understand agents that arise, you should have a theory of arbitrary agents rather than ideal agents.
You’re right that you don’t just want a theory of ideal agents. But I think it’s sufficient to only have a theory of very good agents, and discard the systems that you train that aren’t very good agents. This is more true the more optimistic you are about ML producing very good agents.
Foundations
You’re right that you don’t just want a theory of ideal agents. But I think it’s sufficient to only have a theory of very good agents, and discard the systems that you train that aren’t very good agents. This is more true the more optimistic you are about ML producing very good agents.