… we don’t merely want a precise theory that lets us build an agent; we want our theory to act like a box that takes in an arbitrary agent (such as one built using ML and other black boxes) and allows us to analyze its behavior.
FWIW, this is what I consider myself to be mainly working towards, and I do expect that the problem is directly solvable. I don’t think that’s a necessary case to make in order for HRAD-style research to be far and away the highest priority for AI safety (so it’s not necessarily a crux), but I do think it’s both sufficient and true.
FWIW, this is what I consider myself to be mainly working towards, and I do expect that the problem is directly solvable. I don’t think that’s a necessary case to make in order for HRAD-style research to be far and away the highest priority for AI safety (so it’s not necessarily a crux), but I do think it’s both sufficient and true.