If you can locally inspect cognitive steps for properties that globally add to intelligence, corrigibility, and alignment, you’re done; you’ve solved the AGI alignment problem and you can just apply the same knowledge to directly build an aligned corrigible intelligence.
I agree with the first part of this. The second isn’t really true because the resulting AI might be very inefficient (e.g. suppose you could tell which cognitive strategies are safe but not which are effective).
Overall I don’t think it’s likely to be useful to talk about this topic until having much more clarity on other stuff (I think this section is responding to a misreading of my proposal).
This stuff about inspecting thoughts fits into the picture when you say: “But even if you are willing to spend a ton of time looking at a particular decision, how could you tell if it was optimized to cause a catastrophic failure?” and I say “if the AI has learned how to cause a catastrophic failure, we can hope to set up the oversight process so it’s not that much harder to explain how it’s causing a catastrophic failure” and then you say “I doubt it” and I say “well that’s the hope, it’s complicated” and then we discuss whether that problem is actually soluble.
And that does have a bunch of hard steps, especially the one where we need to be able to open up some complex model that our AI formed of the world in order to justify a claim about why some action is catastrophic.
I agree with the first part of this. The second isn’t really true because the resulting AI might be very inefficient (e.g. suppose you could tell which cognitive strategies are safe but not which are effective).
Overall I don’t think it’s likely to be useful to talk about this topic until having much more clarity on other stuff (I think this section is responding to a misreading of my proposal).
This stuff about inspecting thoughts fits into the picture when you say: “But even if you are willing to spend a ton of time looking at a particular decision, how could you tell if it was optimized to cause a catastrophic failure?” and I say “if the AI has learned how to cause a catastrophic failure, we can hope to set up the oversight process so it’s not that much harder to explain how it’s causing a catastrophic failure” and then you say “I doubt it” and I say “well that’s the hope, it’s complicated” and then we discuss whether that problem is actually soluble.
And that does have a bunch of hard steps, especially the one where we need to be able to open up some complex model that our AI formed of the world in order to justify a claim about why some action is catastrophic.